R: How to get highest values colwise and their respective rownames? - r

I want to get the highest values (lets say highest 3) of all columns of my df. Important for me is to get also the rownames of these values. Here a subset of my data:
structure(list(BLUE.fruits = c(12803543, 3745797, 19947613, 0, 130, 4),
BLUE.nuts = c(21563867, 533665, 171984, 0, 0, 0),
BLUE.veggies = c(92690, 188940, 34910, 0, 0, 577),
GREEN.fruits = c(3389314, 15773576, 8942278, 0, 814, 87538),
GREEN.nuts = c(6399474, 1640804, 464688, 0, 0, 0),
GREEN.veggies = c(15508, 174504, 149581, 0, 0, 6190),
GREY.fruits = c(293869, 0, 188368, 0, 8, 0),
GREY.nuts = c(852646, 144024, 26592, 0, 0, 0),
GREY.veggies = c(2992, 41267, 6172, 0, 0, 0)),
.Names = c("BLUE.fruits", "BLUE.nuts", "BLUE.veggies",
"GREEN.fruits", "GREEN.nuts", "GREEN.veggies", "GREY.fruits",
"GREY.nuts", "GREY.veggies"), row.names = c("Afghanistan", "Albania",
"Algeria", "American Samoa", "Angola", "Antigua and Barbuda"),
class = "data.frame")
I tried this so far for the first column:
as.data.frame(x[,1][order(x[,1], decreasing=TRUE)][1:10]
However, I don't get the original rownames and I need an approach as apply/lapply to go through all columns (~ 150 cols). Ideas? Thanks

This could help:
Print one column of data frame with row names
So if you adapt your code a bit you get:
(A long ugly code line =) , that returns a list, what your desired output format is - based on your "lapply" tag?)
lapply(1:dim(df)[2], function(col.number) df[order(df[, col.number], decreasing=TRUE)[1:3], col.number, drop = FALSE])

You could write a column maximum function, colMax.
colMax <- function(data) sapply(data, max, na.rm = TRUE)

Related

How can I condense a long list of items into categories for a repeated logit regression?

I'm using a program called Apollo to make an ordered logit model. In this model, you have to specify a list of variables like this:
apollo_beta = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum3 = 0)
I want to do two things:
Firstly, I want to be able to specify these beforehand:
specification1 = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to call it:
apollo_beta = specification1
Secondly, I want to be able to make categories:
var1 <- c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0)
var2 <- c(
b_var2_dum1 = 0,
b_var2_dum2 = 0)
var3 <- c(
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to use those in the specification:
specification1 = c(
var1,
var2,
var3)
And then:
apollo_beta = specification1
I know you might not have the best knowledge of the very niche programme Apollo. I am not quite sure if this is even possible, but since it would save me days (maybe weeks) of work, can anyone give me a hint on what I might be doing wrong? I worry I have a list within a list.
Since I have to make 60 specifications of the same model with different variations of 6 variables, it would be a lot of code and lot of work if I can't shorten it like this.
Any tips would be greatly appreciated.
Data:
df <- data.frame(
var1_dum1 = c(0, 1, 0),
var1_dum2 = c(1, 0, 0),
var1_dum3 = c(0, 0, 1),
var2_dum1 = c(0, 1, 0),
var2_dum2 = c(1, 0, 0),
var3_dum1 = c(1, 1, 0),
var3_dum2 = c(1, 0, 0),
var3_dum3 = c(0, 1, 0),
var3_dum4 = c(0, 0, 1),
)
So there is a dataset with these variables. In apollo you specify "database = df" first, so it already refers to the variables.
In the list of apollo_beta, it doesn't refer to the variables directly, so technically you can call it what you want. I just want to call it the same as the variables as I will refer to them later.
My question is simple. Can I condense the long list to simply say "specification1". It's just a question of the r language. Whether the items of the list will function the same way as how it was originally written in code.
In other words, would calling apollo_beta in the above three examples lead to the same result? If not, how do I change the code so that it does lead to the same?

Count number of columns that are not zero in a data frame [duplicate]

This question already has answers here:
Error: `n()` must only be used inside dplyr verbs
(3 answers)
Closed 1 year ago.
I have made the following script to sum rows of a data frame and count number of columns that are not zero for all rows. Suddenly my script stop working and I am not sure what the error is.
test <- structure(list(col1 = c(0.126331200264469, 0, 0, 0, 0), col2 = c(0,
0, 0, 0, 0), col3 = c(0, 0, 0, 0, 0), col4 = c(0, 0, 0, 0, 0),
col5 = c(0, 0, 0, 0, 0)), row.names = c("row1", "row2", "row3",
"row4", "row5"), class = "data.frame")
script:
test.out <- test %>%
mutate(Not_Present = across(everything(), ~ . == 0) %>%
reduce(`+`), Present = ncol(test)- Not_Present)
error:
Error: `across()` must only be used inside dplyr verbs.
Run `rlang::last_error()` to see where the error occurred.
Another option is using rowSums
library(dplyr)
test %>%
mutate(Not_Present = rowSums(across(everything()) == 0),
Present = ncol(test) - Not_Present)
If it helps in any way for further work, I would just go with:
test.out <- sum(apply(test !=0, 2, any))

Is there a way to count occurrences of a specific value for unique columns in a dataframe in R?

I am relatively new to R and have a dataframe (cn_data2) with several duplicated columns. It looks something like this:
Gene breast_cancer breast_cancer breast_cancer lung_cancer lung_cancer
myc 1 0 1 1 2
ARID1A 0 2 1 1 0
Essentially, the rows are genes and the columns are different types of cancers. What I want is to find for each gene the number of times, a value (0,1,or 2) occurs for each unique cancer type.
I have tried several things but haven't been able to achieve what I want. For example, cn_data2$count1 <- rowSums(cn_data == '1') gives me a column with the number of "1" for each gene but what I want the number of "1" for each individual disease.
Hope my question is clear!I appreciate any help, thank you!
structure(list(gene1 = structure(1:6, .Label = c("ACAP3", "ACTRT2",
"AGRN", "ANKRD65", "ATAD3A", "ATAD3B"), class = "factor"), glioblastoma_multiforme_Primary_Tumor = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.1 = c(-1,
-1, -1, -1, -1, -1), glioblastoma_multiforme_Primary_Tumor.2 = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.3 = c(2,
2, 2, 2, 2, 2), glioblastoma_multiforme_Primary_Tumor.4 = c(0,
0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 6L))

Can I convert a certain string upon reading with fread in R data.table?

I am trying to use R's data.table package to read a large data set (~800k rows). The set contains results from a simulation of 1000 scenarios (+ a scenario "0" - so 1,001 scenarios in total) and one of the columns, "ScenId", contains the number of the scenario, e.g. 0,1,2,..
The problem is the program used to output this txt file cannot name scenario 1000 as '1000', but uses 'AAA' instead. The column 'ScenId' thus contains only numbers, apart from the value 'AAA'.
I am trying to find a solution to convert 'AAA' to 1000 preferably within the fread command.
My current workaround is using na.strings = "AAA" in fread and then replacing the NA's with 1000, after reading is complete. This works well because those are the only NA instances in the data set.
However, I was hoping for a quicker / more elegant solution, i.e. to do this within the fread command.
Any help / advice will be much appreciated.
Later edit: an attempt at posting sample data.
structure(list(ScenId = "AAA", SensId = "_", SystemProd = "ZCPP__",
AssumClass = "SPLPSV", ProjPer = 40L, ProjMode = "Annual",
VarName = "belLUL", Description = "(BEL)",
Module = "MLIAB", FormType = "inv", Group = "calc.BEL", Width = 12L,
Decimals = 2L, Scale = "Yes", Value000 = 0, Value001 = 0,
Value002 = 0, Value003 = 0, Value004 = 0, Value005 = 0, Value006 = 0,
Value007 = 0, Value008 = 0, Value009 = 0, Value010 = 0, Value011 = 0,
Value012 = 0, Value013 = 0, Value014 = 0, Value015 = 0, Value016 = 0,
Value017 = 0, Value018 = 0, Value019 = 0, Value020 = 0, Value021 = 0,
Value022 = 0, Value023 = 0, Value024 = 0, Value025 = 0, Value026 = 0,
Value027 = 0, Value028 = 0, Value029 = 0, Value030 = 0, Value031 = 0,
Value032 = 0, Value033 = 0, Value034 = 0, Value035 = 0, Value036 = 0,
Value037 = 0, Value038 = 0, Value039 = 0, Value040 = 0), .Names =("ScenId",
"SensId", "SystemProd", "AssumClass", "ProjPer", "ProjMode",
"VarName", "Description", "Module", "FormType", "Group", "Width",
"Decimals", "Scale", "Value000", "Value001", "Value002", "Value003",
"Value004", "Value005", "Value006", "Value007", "Value008", "Value009",
"Value010", "Value011", "Value012", "Value013", "Value014", "Value015",
"Value016", "Value017", "Value018", "Value019", "Value020", "Value021",
"Value022", "Value023", "Value024", "Value025", "Value026", "Value027",
"Value028", "Value029", "Value030", "Value031", "Value032", "Value033",
"Value034", "Value035", "Value036", "Value037", "Value038", "Value039",
"Value040"), class = c("data.table", "data.frame"), row.names = c(NA,
-1L), .internal.selfref = <pointer: 0x0000000000310788>)
This is just one line of my data set. Hope this makes sense.

Tried streamlining w/ SDCols - got "longer object length is not a multiple of shorter object length"

I have tried searching stackoverflow and google to get answers to my question, but I couldn't find anything that applied closely enough for me to be able to apply it. However, I'm very new to R, so it's likely that I may just need a little walking through it.
If I use the following code, it works just fine.
> dput(b)
structure(list(DUMP_END_SHIFT_DATE = structure(c(1420070400,
1420070400, 1420156800, 1420156800, 1420243200, 1420243200, 1420329600,
1420329600, 1420416000, 1420416000, 1420502400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), QUANTITY_REPORTING = c(235, 219, 232,
219, 219, 219, 219, 219, 219, 219, 235), WTRECV = c(32.71, 32.71,
20.19, 33.42, 21.61, 21.61, 21.61, 20.19, 21.61, 20.19, 24.2),
LC12 = c(0, 0, 0, 94, 100, 100, 100, 0, 100, 0, 100), LC34 = c(0,
100, 0, 6, 0, 0, 0, 0, 0, 0, 0), LC5 = c(0, 0, 5, 0, 0, 0,
0, 5, 0, 5, 0), HIS = c(25, 0, 60, 0, 0, 0, 0, 60, 0, 60,
0), UC = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), IBC = c(75,
0, 35, 0, 0, 0, 0, 35, 0, 35, 0)), .Names = c("DUMP_END_SHIFT_DATE",
"QUANTITY_REPORTING", "WTRECV", "LC12", "LC34", "LC5", "HIS",
"UC", "IBC"), class = c("data.table", "data.frame"), row.names = c(NA,
-11L), .internal.selfref = <pointer: 0x0000000005860788>)
library(data.table)
b_daily <- b[,.(d_tons=sum(QUANTITY_REPORTING)),by=DUMP_END_SHIFT_DATE]
b_daily[,"d_WTRECV" := b[,.(d_WTRECV=sum(QUANTITY_REPORTING*WTRECV)),by=DUMP_END_SHIFT_DATE] [,.(round(d_WTRECV/d_tons, digits=2))]]
b_daily[,"d_LC12" := b[,.(d_LC12=sum(QUANTITY_REPORTING*LC12)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC12/d_tons, digits=2))]]
b_daily[,"d_LC34" := b[,.(d_LC34=sum(QUANTITY_REPORTING*LC34)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC34/d_tons, digits=2))]]
b_daily[,"d_LC5" := b[,.(d_LC5=sum(QUANTITY_REPORTING*LC5)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC5/d_tons, digits=2))]]
b_daily[,"d_HIS" := b[,.(d_HIS=sum(QUANTITY_REPORTING*HIS)),by=DUMP_END_SHIFT_DATE] [,.(round(d_HIS/d_tons, digits=2))]]
b_daily[,"d_UC" := b[,.(d_UC=sum(QUANTITY_REPORTING*UC)),by=DUMP_END_SHIFT_DATE] [,.(round(d_UC/d_tons, digits=2))]]
b_daily[,"d_IBC" := b[,.(d_IBC=sum(QUANTITY_REPORTING*IBC)),by=DUMP_END_SHIFT_DATE] [,.(round(d_IBC/d_tons, digits=2))]]
However, it seems very inelegant - I think that I should be able to do this using SD and SDcols. I tried the following, just as a test case:
b_daily2 <- b[,lapply(.SD, function (x) sum(x*b[,QUANTITY_REPORTING])/sum(b[,QUANTITY_REPORTING])), by=DUMP_END_SHIFT_DATE, .SDcols=c("WTRECV")] [,.(DUMP_END_SHIFT_DATE,d_WTRECV=round(WTRECV, digits=2))]
The resulting numbers are a little off, and I get the following warning:
"In x * MQD[, QUANTITY_REPORTING] : longer object length is not a multiple of shorter object length"
I understand that this indicates recycling due to objects being different lengths...but I don't understand why or what. Any help would be much appreciated. I apologize in advance if this is an elementary question. Thank you.
This is arguably also inelegant, but at least fits into a single operation:
b_daily <- b[,{
d_tons = sum(QUANTITY_REPORTING)
d_WTRECV = round( sum(QUANTITY_REPORTING*WTRECV)/d_tons, digits = 2 )
list(d_tons = d_tons, d_WTRECV = d_WTRECV)
},by=DUMP_END_SHIFT_DATE]
If there are many columns like d_WTRECV, with names stored in cols = c("WTRECV",...), then...
cols <- c("WTRECV","LC12","LC34","LC5","HIS","UC","IBC")
b_daily2 <- b[,{
d_tons = sum(QUANTITY_REPORTING)
res = lapply(mget(cols), function(x)
round( sum(QUANTITY_REPORTING*x)/d_tons, digits = 2 )
)
c(list(d_tons = d_tons), setNames(res, paste0("d_",cols)))
},by=DUMP_END_SHIFT_DATE]
A similar approach using .SDcols will be possible when a bug related to it is fixed.
Aside. I think there is a feature request to allow for the first column to be used in computing the second, like
# NON-WORKING CODE:
b_daily <- b[,.(
d_tons = sum(QUANTITY_REPORTING),
d_WTRECV = round( sum(QUANTITY_REPORTING*WTRECV) / d_tons, digits = 2)
),by=DUMP_END_SHIFT_DATE]
This is how mutate in the dplyr package works. However, for your multicolumn case, dplyr is more of a hassle than a help, as far as I can figure.
By the way, you may want to wait on rounding. Usually, it's only a good idea for printing purposes and just unnecessarily worsens your later calculations.
I don't think there is a particularly elegant way to do this. Here's a quick take.
sdc <- c("WTRECV", "LC12", "LC34", "LC5", "HIS", "UC", "IBC")
b2 <- copy(b)
b2[, (sdc) := lapply(.SD, "*", b2[, QUANTITY_REPORTING]), .SDcols=sdc]
b_daily <- b2[, lapply(.SD, sum), by=DUMP_END_SHIFT_DATE]
data.table(
b_daily[, .(DUMP_END_SHIFT_DATE)],
b_daily[, lapply(lapply(.SD, "/", b_daily[,QUANTITY_REPORTING]), round, 2), .SDcols=sdc]
)

Resources