R read in excel file with merged cells as column headers - r

In biological assays we often have replicate measurements from the same molecule and then plot a dose response curve using the average of 2 or 3 replicates. I would like to read into R an excel file, where the column headers of replicates have been merged - text example below, and link to example file. The read_excel function of the readxl package can read the file in but unmerges the header cells and replaces the empty cells with NAs.
conc | Sample1 | Sample2
-------------------------------------------
10 | 1.5 | 2.5 | 3 | 4
-------------------------------------------
100 | 15 | 25 | 30 | 40
-------------------------------------------
1000 | 150 | 250 | 300 | 400
Is there a way of either preserving the merged cell layout in R or alternatively reading in the columns and automatically replicating/renumbering the headers like below?
conc | Sample1.1 | Sample1.2 | Sample2.1 | Sample2.2
--------------------------------------------------------------
10 | 1.5 | 2.5 | 3 | 4
--------------------------------------------------------------
100 | 15 | 25 | 30 | 40
--------------------------------------------------------------
1000 | 150 | 250 | 300 | 400
Thanks.

Not a complete answer, but it is possible to have a list column, such that multiple values are contained within a single cell. This might serve the same function as the "merged columns" in Excel. Here's an example, just to show what I mean:
library(data.table)
new <- data.table("V1" = c(1,2), "V2" = list(c(1,2,5),c(2,3)) )
Notice that column V2 has 2 vectors within a list (each vector is even a unique length, and each one can be as long or short as you need). Now you can call all the values for a given cell:
> new$V2[[1]]
[1] 1 2 5
Or a specific replication:
> new$V2[[2]][2]
[1] 3
I don't know exactly what your spreadsheet looks like, and getting it from its current form into a "list column" form may be difficult depending on that. Hopefully this gives you some ideas though!

Related

R introduces random x in variables when reading the .csv file. How can I get rid of those?

I'm trying to plot binary data in R and it seems that when r reads my input file, it introduces "x" before every variable.
My data frame is a longer version of this saved in .csv
| Species | 0.5 |2.5 | 4.5 | 5.5 | 7.5 |...
| Black rat | |1 | 0 | 0 | 0 |...
| Norway ra | 0 |0 | 0 | 0 | 0 |...
| Caribou | 0 |0 | 0 | 0 | 0 |...
And once I import it in R with
data <-read.csv("DNA_binaries_flipped.csv")
what I get from
head(data) #Check that data looks correct
is
Species X0.5 X2.5 X4.5
1 Black rat 0 1 0
2 Norway rat 0 0 0
3 Caribou 0 0 0
I late plot the data and obviously the "X" is still there. How can I get rid of this
Here's a photo of the complete plot
I tried to restart R and create a new input file
R does not like to use numbers for column names in data frames, or for vectors in general.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/make.names.html
A syntactically valid name consists of letters, numbers and
the dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ".2way" are not valid, and
neither are the reserved words.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
The character "X" is prepended if necessary. All invalid characters
are translated to ".". A missing value is translated to "NA". Names
which match R keywords have a dot appended to them. Duplicated values
are altered by make.unique.
It is possible to create a data frame with names that are not syntactically valid, but this can lead to unexpected behaviors so is generally avoided.
I'd suggest removing the X from your variable names once you have converted them to variable values and prior to plotting, e.g.
library(tidyverse)
data.frame(row = 1:3, X1 = 1:3, X2.3 = 2:4) %>%
pivot_longer(-row) %>%
mutate(name = str_sub(name, start = 2)) %>%
ggplot(aes(row, name, fill = as.character(value))) +
geom_tile()

Selecting row by min value and merging

This is row selection problem in R.
I want to get only one row in each dataset based on the minimum value of the variable fifteen. I have tried two approaches below, none of them returning the desired output.
For list 1> SPX[[1]] the date.frame is set up like this:
SPX1[[1]]
+----------------------------------+---------+------------------------+----------------+
| | stkPx | expirDate | fifteen |
+----------------------------------+---------+------------------------+----------------+
| 1 | 1461.62 | 2013-01-19 | 2 |
| 2 | 1461.25 | 2013-01-25 | 8 |
| 3 | 1461.35 | 2013-02-01 | 3 |
| . | . | . | . |
| . | . | . | . |
+----------------------------------+---------+------------------------+----------------+
The first approach is aggregating and the merge. As this has to be done for a list the code is in a loop:
df.agg<- list() # creates a list
for (l in 1:length(SPX1)){
df.agg[[l]]<- SPX1[[l]] %>%
aggregate(fifteen ~ ticker, data=SPX1[[l]], min) #Finding minimum value of fifteen for ticker
df.minSPX1 <- merge(df.agg[[l]], SPX1[[l]]) #Merge dataset so we only get row with min fifteen value
}
I get :
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'SPX1' of mode 'function' was not found
Another approach, however it just changes all the values of the first column to one, not deleting any rows when merging:
TESTER<- which.min(SPX1[[1]]$fifteen) # Finds which row has minimum value of fifteen
df.minSPX1 <- merge(TESTER, SPX1[[1]],by.x=by) #Try to merge so I only get the row with min. fifteen
I have tried reading other answers on SO but maybe because of the way the lists are set up this won't work?
I hope you can tell me where I get this wrong.
Try This
df<- lapply(SPX, function(x) x[x$fifteen==min(x$fifteen),])
df<- as.data.frame(df)
Edit:
As suggested by #Gregor-Thomas this will work when there is a tie.
df<- lapply(SPX, function(x) x[which.min(x$fifteen), ])
df<- as.data.frame(df)

I am trying to create a new column that is conditional upon the contents of existing columns

I am trying to make several new columns in my data frame that are conditional upon the contents of a few existing columns.
In pseudo-code, the arguments basically go "If VariableA is 1 and either VariableB is 2 or VariableC is 3, then make VariableD = 1, and if it does not meet these conditions make it a zero.
I have tried using for loops and ifelse statements, but have had no luck. I know that the logic of my commands are correct, but I am making some error translating it into R syntax, which isn't surprising because I just started using R about a week ago.
Below is a simplified version of what I have tried doing...
Data$VariableD <- ifelse(Data$VariableA == 'Jim' && (Data$VariableB == 2 || Data$VariableC == 3), 1, 0)
It runs without error, but upon examining the contents of VariableD, all cells are filled with "NA"
Here is an example using a similar dataset, notice row 1 meets the criteria. (I can't make a proper table to save my life, but I think it's interpretable.
|Variable A|Variable B|Variable C|Variable D|
|1| Jim | 2 | 4 | NA |
|2| Tom | 2 | 3 | NA |
|3| Tom | 3 | 4 | NA |
Could you please provide the class of your variables? Could be a problem of class...(by example you put ==1 for variableA, but if variableA is a class chr should be == "1").
Could you please also provide your full loop?
Otherwise, please try that:
Data$VariableD <- ifelse(Data$VariableA == 1 & (Data$VariableB == 2 | Data$VariableA == 3), 1, 0)

How to return the row from a data frame based on the maximum value of the data frame in R?

Suppose I have got a data frame as displayed below. Most of the suggestions I found on Stackoverflow aim at getting the max from one column and then returning the row index.
I was wondering whether there is a way to return the row index of the data frame by scanning two or more columns for the maximum.
To summarize, from the example below, I want to get the row:
11 building_footprint_sum 0.003 0.470
which holds the maximum of the data frame
+----+-------------------------+--------------------+-------------------+
| id | plot_name | rsquare_allotments | rsquare_block_dev |
+----+-------------------------+--------------------+-------------------+
| 6 | building_footprint_max | 0.002 | 0.421 |
| 7 | building_footprint_mean | 0.002 | 0.354 |
| 8 | building_footprint_med | 0.002 | 0.350 |
| 9 | building_footprint_min | 0.002 | 0.278 |
| 10 | building_footprint_sd | 0.003 | 0.052 |
| 11 | building_footprint_sum | 0.003 | 0.470 |
+----+-------------------------+--------------------+-------------------+
Is there a rather simple way to achieve this?
You are looking for the row index in which a matrix attains its maximum. You can do this by using which() with the arr.ind=TRUE option:
> set.seed(1)
> foo <- matrix(rnorm(6),3,2)
> which(foo==max(foo),arr.ind=TRUE)
row col
[1,] 1 2
So in this case, you would need row 1. (And you can discard the col output.)
If you go this route, be wary of floating point arithmetic and == (see FAQ 7.31). Better to do this:
> which(foo>max(foo)-0.01,arr.ind=TRUE)
row col
[1,] 1 2
where you use an appropriate small value in place of 0.01.
Try using pmax
?pmax
pmax and pmin take one or more vectors (or matrices) as arguments and
return a single vector giving the ‘parallel’ maxima (or minima) of the vectors.
I would suggest making this in two steps
# make a new column that compares column 3 and column 4 and returns the larger value
> df$new <- pmax(df$rsquare_allotments, df$rsquare_block_dev)
# look for the row, where the new variable has the largest value
> df[(df$new == max(df$new)), ][3:4]
Consider that if the max value occurs more than once, your result will have more than one row

Limit frequency table to one entry per unique ID

I have a data frame that tracks error codes generated from a script. Every time the script is executed, it adds records to a massive CSV file. The event_id field is a unique to each time the script is run. Each run may add multiple combinations of CRITICAL, ERROR, WARNING, DIAGNOSTIC, or INFORMATION messages with accompanying values and additional information (not represented here for simplicity).
I need to summarize the number of each class of error in the CSV file, but multiple errors from the same event id should only count as one error. Here's an example of how the data is structured:
event_id | class | value
1 | ERROR | 5409
1 | ERROR | 5410
2 | WARNING | 212
3 | ERROR | 5409
3 | WARNING | 400
3 | DIAGNOSTIC | 64
And this is what I'm looking to get as output. Even though there were three ERROR class events, two of them were associated with the same event, so it only counts as one.
class | count
ERROR | 2
WARNING | 2
DIAGNOSTIC | 1
I did try searching for this, but don't even know what keywords to search for. So even if you aren't able to answer the question, I'd appreciate any help with search queries.
df = read.table(header = T, sep = "|", text = "
event_id | class | value
1 | ERROR | 5409
1 | ERROR | 5410
2 | WARNING | 212
3 | ERROR | 5409
3 | WARNING | 400
3 | DIAGNOSTIC | 64")
df = as.data.table(df)
setkey(df, event_id, class)
unique(df)[, .N, by = class]
# class N
#1: ERROR 2
#2: WARNING 2
#3: DIAGNOSTIC 1
You could split class by event id then create a data frame.
> s <- sapply(split(dat$event_id, dat$class), function(x) length(unique(x)))
> data.frame(count = s)
## count
## DIAGNOSTIC 1
## ERROR 2
## WARNING 2
You could build a 2-d table using the class and event_id variables, use pmin to limit values to 1 in that table, and then use rowSums to get it back to a 1-d table:
rowSums(pmin(table(dat$class, dat$event_id), 1))
# DIAGNOSITIC ERROR WARNING
# 1 2 2

Resources