Combine or Sum rows based on partial match and other rules - r

I have a dataframe df1:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
Step = c("A","A","B","B","C","C","C"),
kg = c(31,32,14,16,10,11,10))
Sometimes at a particular 'Step' a 'Lot' gets split into A,B or C as indicated. I'd like to sum those and get a dataframe that tells me the total kg at each step, for each lot.
For example the output should look like this:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018"),
Step = c("A","B","A","C"),
kg = c(31,30,32,31))
So there are two requirements. If the 'Lot' matches, regardless of the trailing letter, and the step matches, then the sum occurs. If both conditions are not satisfied, then just carry over the line item as is into df2.
Part2:
So I would like to introduce a 3rd requirement. In some cases, the Lot was split in two or 3 parts, however not all the data is present. In this case, using these solutions masks this and makes it appear that one lot has much lower kg than it actually has.
What I would like to do is find a way to indicate if the dataset contains 13VC011A for example, but no 13VC011B. Or if we see a 'B' but no 'A' or a 'C' but no 'B' or 'A'.
So now the original dataframe is:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
Step = c("A","A","B","B","C","C","C","B"),
kg = c(31,32,14,16,10,11,10,18))
And the resultant df2 should look something like:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
Step = c("A","B","A","C","B"),
kg = c(31,30,32,31,18),
Partial = c(F,F,F,F,T))

df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
aggregate(kg~Lot+Step,df1, FUN=sum)
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Or using dplyr
library(stringr)
library(dplyr)
df1%>%
group_by(Lot=str_extract(Lot,perl('.*\\d(?=[A-Z]?$)')), Step) %>%
summarize(kg=sum(kg))
#Source: local data frame [4 x 3]
#Groups: Lot
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Explanation
regex
.* : select more than one element
\\d :followed by a digit
(?=[A-Z]?$) : and lookahead for character elements or (?) not at the $ end of string.
`

> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
Lot Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011A B 14
4 13VC011B B 16
5 13VC018A C 10
6 13VC018B C 10
7 13VC018C C 11
At that point I finally understood what you meant by "regardless of the trailing letter" and wondered if the formula method of aggregate could handle an R-function in one of the terms:
> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
substr(Lot, 1, 7) Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011 B 30
4 13VC018 C 31

Related

Replace value from updated dataset based on number of instances it appears in a second dataset

I have a simple 2-column dataset containing variable cluster_size and index. Originally all values of index were assigned a value 1. Subsequently, I received a second dataset containing only a few clusters where index should updated with different integer values.
I simply need to replace the index value from the updated dataset. My specific issue is that the values for cluster_size can repeat multiple times, but I only need to replace it for the number of instances it appears in the updated dataset. For instance, in the example data below, the cluster_size value of 34 appears three times, but only once in the updated data with an index of 6. This means that only one of these three rows should update to 6 (doesn't matter which one).
Code to recreate a 20-row sample of the original data (have), updated subset (updated), and desired dataset (want) are below. The actual data has tens of thousands of rows. Ive tried several merge and loop functions (all too pathetic to waste your time by posting here), but cant seem to find an elegant solution.
# Data with original index cases
set.seed(03151813)
have <- data.frame(clust_size=sample(1:50,20,replace=TRUE),index=rep(1,times=20))
have <- have[order(have$clust_size),]
# Updated data only contains clusters that need updating of inde
updated <- data.frame(clust_size=c(30,34,42,44,44,46),
index=c(2,6,4,8,9,4))
# Desired dataset
want <- data.frame(clust_size=have$clust_size,
index=c(rep(1,times=9),2,1,6,
1,1,1,4,1,8,9,4))
Here is a base R approach. Add row numbers to have and updated for each clust_size. So the clust_size of 34 will have rows numbered consecutively 1, 2, and 3.
Then, you can merge the two together on both clust_size and row number. If you include all.x you will get all rows from the first data frame have.
Final step is to replace the missing NA values in your new index column with the original index.
have$rn <- with(have, ave(seq_along(clust_size), clust_size, FUN = seq_along))
updated$rn <- with(updated, ave(seq_along(clust_size), clust_size, FUN = seq_along))
want <- merge(have, updated, all.x = TRUE, by = c("clust_size", "rn"))
want$index.y <- ifelse(is.na(want$index.y), want$index.x, want$index.y)
want[, c("clust_size", "index.y")]
An alternate version using dplyr would be something like this:
library(dplyr)
have2 <- have %>%
group_by(clust_size) %>%
mutate(rn = row_number())
updated2 <- updated %>%
group_by(clust_size) %>%
mutate(rn = row_number())
left_join(have2, updated2, by = c("clust_size", "rn")) %>%
mutate(index.y = coalesce(index.y, index.x))
Output
clust_size index.y
1 1 1
2 5 1
3 8 1
4 10 1
5 16 1
6 20 1
7 22 1
8 27 1
9 29 1
10 30 2
11 30 1
12 34 6
13 34 1
14 34 1
15 35 1
16 42 4
17 43 1
18 44 8
19 44 9
20 46 4

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Subset by first and last value per group

I have a data frame in R with two columns temp and timeStamp. The data has temp values regularly. A portion of dataframe looks like-
I have to create line chart showing changes in temp over time. As can be seen here, temp values remain the same for several timeStamp. Having these repeating value increases the size of data file and I want to remove them. So the output should look like this-
Showing just the values where there is a change.
Cannot think of a way to get this think done in R. Any inputs in the right direction would be really helpful.
Here's a dplyr solution:
# Toy data
df <- data.frame(time = seq(20), temp = c(rep(60, 5), rep(61, 7), rep(59, 3), rep(60, 5)))
# Now filter for the first and last rows and ones bracketing a temperature change
df %>% filter(temp!=lag(temp) | temp!=lead(temp) | time==min(time) | time==max(time))
time temp
1 1 60
2 5 60
3 6 61
4 12 61
5 13 59
6 15 59
7 16 60
8 20 60
If the data are grouped by a third column (id), just add group_by(id) %>% before the filtering step.
One option would be using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'temp', we subset the first and last observation (.SD[c(1L, .N)]) per each group. If there is only a single value per group, we take the row as such (else .SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD[c(1L, .N)] else .SD, by =temp]
# temp val
#1: 22.50 1
#2: 22.50 4
#3: 22.37 5
#4: 22.42 6
#5: 22.42 7
Or a base R option with duplicated. We check the duplicated values in 'temp' (output is a logical vector), and also check the duplication from the reverse side (fromLast=TRUE). Use & to find the elements that are TRUE in both cases, negate (!) and subset the rows of 'df1'.
df1[!(duplicated(df1$temp) & duplicated(df1$temp,fromLast=TRUE)),]
# temp val
#1 22.50 1
#4 22.50 4
#5 22.37 5
#6 22.42 6
#7 22.42 7
data
df1 <- data.frame(temp=c(22.5, 22.5, 22.5, 22.5, 22.37,22.42, 22.42), val=1:7)

Resources