I have a table with many columns (fields). In the first field, I need to retain only unique values. In the subsequent columns, I need to count the original number of values present in the first column, but only if the value in a given column is > 0.
I've managed to partially accomplish this with awk, but my current attempt would require me to manually create an array for every column in the table and manually type each array for the print command. This isn't really feasible.
Any help/suggestions (and explanation of how a potential solution works) would be greatly appreciated.
Here's a subset of the INPUT TABLE (it has already been sorted on column 1):
ATP6 93.883156 55.84006
COX1 230.708456 63.109
COX2 179.993226 74.224269
COX3 169.945901 72.036519
CYTB 228.799722 87.575892
LOC111099029 0.926958 6.124982
LOC111099030 10.124096 5.024844
LOC111099031 0 0
LOC111099031 0 0
LOC111099031 2.279801 2.289838
LOC111099032 17.674714 12.796428
LOC111099033 5.259716 7.326938
LOC111099034 3.514635 2.858349
LOC111099035 0 0
LOC111099035 1.929607 4.409107
LOC111099036 0 0
LOC111099036 1.45196 7.58513
LOC111099037 21.520663 26.353308
LOC111099038 6.019084 5.311657
LOC111099039 12.858404 13.689644
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0 0
LOC111099040 0.354202 0.265986
LOC111099040 0.587969 0
LOC111099040 2.620288 1.077892
LOC111099040 4.290659 3.487692
LOC111099040 6.42671 6.906503
LOC111099041 0 0
LOC111099041 3.892818 4.934959
LOC111099042 0 0
LOC111099042 13.86859 14.319505
LOC111099043 0 0
Here's an example of the DESIRED OUTPUT:
LOC111099030 1 1
CYTB 1 1
LOC111099042 1 1
LOC111099037 1 1
LOC111099033 1 1
COX3 1 1
ATP6 1 1
LOC111099039 1 1
LOC111099036 1 1
LOC111099040 5 4
LOC111099035 1 1
LOC111099032 1 1
COX2 1 1
LOC111099038 1 1
LOC111099031 1 1
COX1 1 1
LOC111099029 1 1
LOC111099041 1 1
LOC111099034 1 1
Here's the code I've run to obtain the output above:
awk '{if ($2 > 0) gene_name[$1]++}; {if ($3 > 0) col3_arr[$1]++}; END{ for (var in gene_name) print var, "\t", gene_name[var], col3_arr[var]}' input_file.txt
P.S. I'm also open to a solution in R, as this manipulation is part of a larger R Markdown notebook. I went the awk route because I'm not particularly well-versed with R.
In R, with dplyr:
library(dplyr)
desired_result = your_data %>%
group_by(name_of_first_column) %>%
summarize(across(everything(), ~sum(. > 0)))
In base R, we may use rowsum
rowsum(+(df1[-1] > 0), df1[[1]])
I have a cudf dataframe
type(pred)
> cudf.core.dataframe.DataFrame
print(pred)
> action
1778378 0
1778379 1
1778381 1
1778383 0
1778384 0
... ...
2390444 0
2390446 0
2390478 0
2390481 0
2390489 1
that I would like to convert to a pandas.DataFrame(). Though
pd.DataFrame(pred)
> 0
0 action
And just found the answer:
pred.to_pandas()
I have generate a heatmap with pheatmap and for some reasons, I want that the rows appear in a predefined order.
I see in previous posts that the solution is to set the paramater cluster_row to FALSE, and to order the matrix in the order we want, like this in my case:
Otu0085 Otu0086 Otu0087 Otu0088 Otu0091
AB200 0 0 0 0 0
2 91 0 2 1 0
20CF360 0 1 0 1 0
19CF359 0 0 0 2 0
11VP12 0 0 0 0 155
11VP04 4 1 0 0 345
However, when I do:
pheatmap(shared,cluster_rows = F)
My rows are sorted alphabetically, like this:
10CF278a
11
11AA07
11CF278b
11VP03
11VP04
11VP05
11VP06
11VP08
11VP09
ANy suggestions would be welcome
Thank's by advance
I am trying to write a function that "variabilizes" the ddply call:
december <- ddply(adk47, .(PeakName, Elevation), summarize,
needThese=if(sum(dec) == 0) "needThis"
else character(0), .progress='text')
Where there are 3 letter column names for each month in the df. I am trying to write the function as:
need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
summarize,
needThese=if(sum(monthCol) == 0)
"needThis" else character(0)
)
return(needThese)
}
but when I call this with
need.fr.month(adk47, oct)
or with
need.fr.month(adk47, "oct")
I get the these error messages:
Error in eval(expr, envir, enclos) : object 'monthCol' not found
or
Error in sum("monthCol") : invalid 'type' (character) of argument
I know that I am not getting something very basic, but I don't know what.
I am using this DF to practice writing R functions. My other functions have gone fairly well; however, this is the first function in which I am trying to variabilize a df column.
Help would be gratefully appreciated.
Here is a Reproduceable Example for a subset of the data
PeakName Elevation jul aug sep oct nov dec
Algonquin 5114 0 0 1 0 0 0
Algonquin 5114 0 0 0 0 0 0
Algonquin 5114 0 0 0 1 0 0
Algonquin 5114 1 0 0 0 0 0
Allen 4340 0 0 0 0 0 0
Allen 4340 0 0 0 0 0 0
Allen 4340 0 0 1 0 0 0
Allen 4340 1 0 0 0 0 0
Allen 4340 0 0 0 0 1 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 0 1 0
Armstrong 4400 0 0 0 0 0 0
Armstrong 4400 0 0 0 1 0 0
Basin 4827 1 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 0 0
Basin 4827 0 0 0 0 1 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 0 1 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 1 0 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 0 0 0 0 0 0
Big.Slide 4240 1 0 0 0 0 0
I hope this suffices. Clearly this is a subset of the data. The form is that each "hike" has one line with the months columns (here truncated to July thru December) indicating a "1" for one month and a zero for the other 11.
Thanks
Wayne
When you call
need.fr.month(adk47, oct)
R looks for a variable named oct in your general environment, and finds nothing. Therefore it reports that it is not found.
If you call:
need.fr.month(adk47, "oct")
R attempts to use the string "oct" in place of monthCol. But taking the sum of a character string doesn't make sense, so it throws an error.
Passing arguments into inner functions can be difficult. A quick kludge is by the infamous eval-parse construct. While it gets the job done, it's generally not recommended because there are often simpler methods to do the same job.
need.fr.month <- function(df, monthCol) {
needThese <- eval(parse(text=paste0("ddply(df, .(PeakName, Elevation),
summarize,
needThese=if(sum(", monthCol, ") == 0)
"needThis" else character(0)
")))
)
return(needThese)
}
Here, you don't need to eval-parse to get what you want. Just don't use summarize and rely on the base R extraction functions:
need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
function(x) sum(x[[monthCol]]))
return(needThese)
#return(needThese[needThese[["V1"]] != 0,])
}
I think this approach could be made better, but I can't improve on it further without knowing what you want to do with the information. If you want to find the rows that you'd like to subset, I think it would be better to do something like:
need.fr.month <- function(df, monthCol) {
ave(df[[monthCol]],df[["PeakName"]],df[["Elevation"]],FUN=sum)
}
adk47$need <- need.fr.month(adk47,"dec") == 0
This then gives you a column in your data frame that will let you subset for the data you are looking for, via adk47$need == TRUE.
As it seems, summarize cannot find objects from the environment that calls ddply. However, you can manually attach this environment to the search path. After the ddply call, you can detach the environment.
Here a quick example - a similar approach should work for you as well.
test_fun=function(team_vec)
{
attach(environment())
tmp=ddply(baseball,
"team",
summarise,
duration=(if (unique(team)%in%team_vec) max(year)-min(year) else 0)
)
detach(environment())
tmp
}
test_fun(c("PIT","PHI"))
Thanks all, both of these are very useful.
I went with a modified version of Blue Magister's 2nd example:
need.fr.month <- function(df, monthCol) {
needThese <- ddply(df, .(PeakName, Elevation),
function(x) sum(x[[monthCol]]))
subsetNeedThese <- subset(needThese, V1 == 0, select=c(PeakName, Elevation))
}
as it returns exactly what I need and I understand what it is doing. I haven't dealt with attaching and detaching environments before, so I thank croy111 for the example. I will need to read up on this! Likewise, Blue Magister's eval-parse does seem like an easy way for me to do something I really don't understand properly.
I appreciated Blue Magister's comment: "Passing arguments into inner functions can be difficult". I will accept, for now, that this problem goes away if you avoid calling an inner function (such as "summarize") and think about it again next time I run into a problem like this!!
I think it would be far easier to create a column for which your indicator variables would be indicator variables (as describie Optimization: splitting dataframe into a list of dataframes, transforming data per row) and then subset from that.
I would advocate using data.table not ddply + summarize for efficiency (but perhaps this is a longer term goal)
Using data.table to access set (which will work on data.frames)
library(data.table)
adk47$monthCol <- character(nrow(adk47))
# data.table specific
# adk47 <- data.table(adk47)
# adk47[, monthCol := character(nrow(adk47))]
# find which columns are == 1
whiches <- lapply(adk47[c("jul", "aug", "sep", "oct", "nov", "dec")],
function(x) which(x==1))
# data.table approach would require
# adk47[c("jul", "aug", "sep", "oct", "nov", "dec"),with = TRUE]
for(val in names(whiches)){
set(adk47, i = whiches[[val]], j = 'monthCol', value = val)
}
head(adk47)
PeakName Elevation jul aug sep oct nov dec monthCol
1 Algonquin 5114 0 0 1 0 0 0 sep
2 Algonquin 5114 0 0 0 0 0 0
3 Algonquin 5114 0 0 0 1 0 0 oct
4 Algonquin 5114 1 0 0 0 0 0 jul
5 Allen 4340 0 0 0 0 0 0
6 Allen 4340 0 0 0 0 0 0
You can then subset using monthCol
I am currently trying to set up a heatmap of a matrix and highlight specific cells, based on two other matrices.
An example:
> SOI
NAP_G021 NAP_G033 NAP_G039 NAP_G120 NAP_G122
2315101 59.69418 27.26002 69.94698 35.22521 38.63995
2315102 104.15294 76.70379 114.72999 97.35930 79.46014
2315104 164.32822 61.83898 140.99388 63.25482 105.48041
2315105 32.15792 21.03730 26.89965 36.25943 40.46321
2315103 74.67434 82.49875 133.89709 93.17211 35.53019
> above150
NAP_G021 NAP_G033 NAP_G039 NAP_G120 NAP_G122
2315101 0 0 0 0 0
2315102 0 0 0 0 0
2315104 1 0 0 0 0
2315105 0 0 0 0 0
2315103 0 0 0 0 0
> below30
NAP_G021 NAP_G033 NAP_G039 NAP_G120 NAP_G122
2315101 0 1 0 0 0
2315102 0 0 0 0 0
2315104 0 0 0 0 0
2315105 0 1 1 0 0
2315103 0 0 0 0 0
Now I create a normal heatmap:
heatmap(t(SOI), Rowv = NA, Colv = NA)
Now what I want to do is highlight the cells, that have a 1 in above150 with a frame of one colour (e.g. blue), whilst the cells with a 1 in below30 should get a red frame. Of couse all matrices are equal sized as they are related. I know that I can add things to the heatmap after processing via add.expr, but so far I just managed to create full ablines that span the whole heatmap => not what I'm looking for.
If anybody has any suggestions I would be delighted.
When add.expr is called the plot is set so that the centre of the cells is at unit integer values. Try add.expr=points(1:5,1:5) to see. Now all you need to do is write a function that draws boxes (help(rect)) with the colours you need, at the half-integer coordinates.
Try this:
set.seed(310366)
nx=5
ny=6
SOI=matrix(rnorm(nx*ny,100,50),nx,ny)
colnames(SOI)=paste("NAP_G0",sort(as.integer(runif(ny,10,99))),sep="")
rownames(SOI)=sample(2315101:(2315101+nx-1))
above150 = SOI>150
below30=SOI<30
makeRects <- function(tfMat,border){
cAbove = expand.grid(1:nx,1:ny)[tfMat,]
xl=cAbove[,1]-0.49
yb=cAbove[,2]-0.49
xr=cAbove[,1]+0.49
yt=cAbove[,2]+0.49
rect(xl,yb,xr,yt,border=border,lwd=3)
}
heatmap(t(SOI),Rowv = NA, Colv=NA, add.expr = {
makeRects(above150,"red");makeRects(below30,"blue")})