Limit frequency table to one entry per unique ID - r

I have a data frame that tracks error codes generated from a script. Every time the script is executed, it adds records to a massive CSV file. The event_id field is a unique to each time the script is run. Each run may add multiple combinations of CRITICAL, ERROR, WARNING, DIAGNOSTIC, or INFORMATION messages with accompanying values and additional information (not represented here for simplicity).
I need to summarize the number of each class of error in the CSV file, but multiple errors from the same event id should only count as one error. Here's an example of how the data is structured:
event_id | class | value
1 | ERROR | 5409
1 | ERROR | 5410
2 | WARNING | 212
3 | ERROR | 5409
3 | WARNING | 400
3 | DIAGNOSTIC | 64
And this is what I'm looking to get as output. Even though there were three ERROR class events, two of them were associated with the same event, so it only counts as one.
class | count
ERROR | 2
WARNING | 2
DIAGNOSTIC | 1
I did try searching for this, but don't even know what keywords to search for. So even if you aren't able to answer the question, I'd appreciate any help with search queries.

df = read.table(header = T, sep = "|", text = "
event_id | class | value
1 | ERROR | 5409
1 | ERROR | 5410
2 | WARNING | 212
3 | ERROR | 5409
3 | WARNING | 400
3 | DIAGNOSTIC | 64")
df = as.data.table(df)
setkey(df, event_id, class)
unique(df)[, .N, by = class]
# class N
#1: ERROR 2
#2: WARNING 2
#3: DIAGNOSTIC 1

You could split class by event id then create a data frame.
> s <- sapply(split(dat$event_id, dat$class), function(x) length(unique(x)))
> data.frame(count = s)
## count
## DIAGNOSTIC 1
## ERROR 2
## WARNING 2

You could build a 2-d table using the class and event_id variables, use pmin to limit values to 1 in that table, and then use rowSums to get it back to a 1-d table:
rowSums(pmin(table(dat$class, dat$event_id), 1))
# DIAGNOSITIC ERROR WARNING
# 1 2 2

Related

Selecting row by min value and merging

This is row selection problem in R.
I want to get only one row in each dataset based on the minimum value of the variable fifteen. I have tried two approaches below, none of them returning the desired output.
For list 1> SPX[[1]] the date.frame is set up like this:
SPX1[[1]]
+----------------------------------+---------+------------------------+----------------+
| | stkPx | expirDate | fifteen |
+----------------------------------+---------+------------------------+----------------+
| 1 | 1461.62 | 2013-01-19 | 2 |
| 2 | 1461.25 | 2013-01-25 | 8 |
| 3 | 1461.35 | 2013-02-01 | 3 |
| . | . | . | . |
| . | . | . | . |
+----------------------------------+---------+------------------------+----------------+
The first approach is aggregating and the merge. As this has to be done for a list the code is in a loop:
df.agg<- list() # creates a list
for (l in 1:length(SPX1)){
df.agg[[l]]<- SPX1[[l]] %>%
aggregate(fifteen ~ ticker, data=SPX1[[l]], min) #Finding minimum value of fifteen for ticker
df.minSPX1 <- merge(df.agg[[l]], SPX1[[l]]) #Merge dataset so we only get row with min fifteen value
}
I get :
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'SPX1' of mode 'function' was not found
Another approach, however it just changes all the values of the first column to one, not deleting any rows when merging:
TESTER<- which.min(SPX1[[1]]$fifteen) # Finds which row has minimum value of fifteen
df.minSPX1 <- merge(TESTER, SPX1[[1]],by.x=by) #Try to merge so I only get the row with min. fifteen
I have tried reading other answers on SO but maybe because of the way the lists are set up this won't work?
I hope you can tell me where I get this wrong.
Try This
df<- lapply(SPX, function(x) x[x$fifteen==min(x$fifteen),])
df<- as.data.frame(df)
Edit:
As suggested by #Gregor-Thomas this will work when there is a tie.
df<- lapply(SPX, function(x) x[which.min(x$fifteen), ])
df<- as.data.frame(df)

I am trying to create a new column that is conditional upon the contents of existing columns

I am trying to make several new columns in my data frame that are conditional upon the contents of a few existing columns.
In pseudo-code, the arguments basically go "If VariableA is 1 and either VariableB is 2 or VariableC is 3, then make VariableD = 1, and if it does not meet these conditions make it a zero.
I have tried using for loops and ifelse statements, but have had no luck. I know that the logic of my commands are correct, but I am making some error translating it into R syntax, which isn't surprising because I just started using R about a week ago.
Below is a simplified version of what I have tried doing...
Data$VariableD <- ifelse(Data$VariableA == 'Jim' && (Data$VariableB == 2 || Data$VariableC == 3), 1, 0)
It runs without error, but upon examining the contents of VariableD, all cells are filled with "NA"
Here is an example using a similar dataset, notice row 1 meets the criteria. (I can't make a proper table to save my life, but I think it's interpretable.
|Variable A|Variable B|Variable C|Variable D|
|1| Jim | 2 | 4 | NA |
|2| Tom | 2 | 3 | NA |
|3| Tom | 3 | 4 | NA |
Could you please provide the class of your variables? Could be a problem of class...(by example you put ==1 for variableA, but if variableA is a class chr should be == "1").
Could you please also provide your full loop?
Otherwise, please try that:
Data$VariableD <- ifelse(Data$VariableA == 1 & (Data$VariableB == 2 | Data$VariableA == 3), 1, 0)

Getting a dataframe of logical values from a vector of statements

I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"

R read in excel file with merged cells as column headers

In biological assays we often have replicate measurements from the same molecule and then plot a dose response curve using the average of 2 or 3 replicates. I would like to read into R an excel file, where the column headers of replicates have been merged - text example below, and link to example file. The read_excel function of the readxl package can read the file in but unmerges the header cells and replaces the empty cells with NAs.
conc | Sample1 | Sample2
-------------------------------------------
10 | 1.5 | 2.5 | 3 | 4
-------------------------------------------
100 | 15 | 25 | 30 | 40
-------------------------------------------
1000 | 150 | 250 | 300 | 400
Is there a way of either preserving the merged cell layout in R or alternatively reading in the columns and automatically replicating/renumbering the headers like below?
conc | Sample1.1 | Sample1.2 | Sample2.1 | Sample2.2
--------------------------------------------------------------
10 | 1.5 | 2.5 | 3 | 4
--------------------------------------------------------------
100 | 15 | 25 | 30 | 40
--------------------------------------------------------------
1000 | 150 | 250 | 300 | 400
Thanks.
Not a complete answer, but it is possible to have a list column, such that multiple values are contained within a single cell. This might serve the same function as the "merged columns" in Excel. Here's an example, just to show what I mean:
library(data.table)
new <- data.table("V1" = c(1,2), "V2" = list(c(1,2,5),c(2,3)) )
Notice that column V2 has 2 vectors within a list (each vector is even a unique length, and each one can be as long or short as you need). Now you can call all the values for a given cell:
> new$V2[[1]]
[1] 1 2 5
Or a specific replication:
> new$V2[[2]][2]
[1] 3
I don't know exactly what your spreadsheet looks like, and getting it from its current form into a "list column" form may be difficult depending on that. Hopefully this gives you some ideas though!

How to return the row from a data frame based on the maximum value of the data frame in R?

Suppose I have got a data frame as displayed below. Most of the suggestions I found on Stackoverflow aim at getting the max from one column and then returning the row index.
I was wondering whether there is a way to return the row index of the data frame by scanning two or more columns for the maximum.
To summarize, from the example below, I want to get the row:
11 building_footprint_sum 0.003 0.470
which holds the maximum of the data frame
+----+-------------------------+--------------------+-------------------+
| id | plot_name | rsquare_allotments | rsquare_block_dev |
+----+-------------------------+--------------------+-------------------+
| 6 | building_footprint_max | 0.002 | 0.421 |
| 7 | building_footprint_mean | 0.002 | 0.354 |
| 8 | building_footprint_med | 0.002 | 0.350 |
| 9 | building_footprint_min | 0.002 | 0.278 |
| 10 | building_footprint_sd | 0.003 | 0.052 |
| 11 | building_footprint_sum | 0.003 | 0.470 |
+----+-------------------------+--------------------+-------------------+
Is there a rather simple way to achieve this?
You are looking for the row index in which a matrix attains its maximum. You can do this by using which() with the arr.ind=TRUE option:
> set.seed(1)
> foo <- matrix(rnorm(6),3,2)
> which(foo==max(foo),arr.ind=TRUE)
row col
[1,] 1 2
So in this case, you would need row 1. (And you can discard the col output.)
If you go this route, be wary of floating point arithmetic and == (see FAQ 7.31). Better to do this:
> which(foo>max(foo)-0.01,arr.ind=TRUE)
row col
[1,] 1 2
where you use an appropriate small value in place of 0.01.
Try using pmax
?pmax
pmax and pmin take one or more vectors (or matrices) as arguments and
return a single vector giving the ‘parallel’ maxima (or minima) of the vectors.
I would suggest making this in two steps
# make a new column that compares column 3 and column 4 and returns the larger value
> df$new <- pmax(df$rsquare_allotments, df$rsquare_block_dev)
# look for the row, where the new variable has the largest value
> df[(df$new == max(df$new)), ][3:4]
Consider that if the max value occurs more than once, your result will have more than one row

Resources