Determine Row Number based on Nonzero Elements - r

I am currently working with about 301 rows of data and want to determine the earliest point at which only a few particular columns are nonzero. However, I also want to ensure that this does not change. For example, the two columns are nonzero, while all other columns are zero, then later in the dataframe other columns are nonzero as well, this would mean that I would have to determine a later point which is "correct".
I have the data:
1 x y z xx xy xz
292 0 -8.965140 9.596890 0 0 0 -0.03147483
293 0 -9.079889 9.645991 0 0 0 -0.02722520
294 0 -8.967767 9.597826 0 0 0 0
295 0 -9.090561 9.650230 0 0 0 -0.02685287
296 0 -9.081568 9.646105 0 0 0 -0.02716237
297 0 0.000000 0.000000 0 0 0 0.00000000
298 0 0.000000 0.000000 0 0 0 0.00000000
299 0 -9.098568 9.628576 0 0 0 -0.02654466
300 0 -9.089815 9.646099 0 0 0 -0.02681748
301 0 -8.998078 9.605140 0 0 0 0
As you can see, only the variables x and y are selected for row 294, however, the xz variable contains values after that until the 301 row. Is it possible to develop a function which tells me at which point is the minimum row where I see only x and y as nonzero and it remains that way until the final row of the dataframe?
I'm sorry if it's difficult to understand the question, I found it difficult asking how exactly to accomplish this issue.
EDIT: I presume I could use something like
which((df$x != 0 & df$y != 0 &
(df[, 1] | df[, 4] == 0))
but then I need to somehow expand the second or statement to all columns of df.
Thanks in advance.

Related

How to import and transform adjacency matrix to R edge list?

A sample of my data can be seen below. The data contains information about ties between organizations (over 2000 organizations, the csv file has 0s and 1s, and empty cells)
A2654 B0004 B0188 B1278 B1372 B1722 B2503
A2654 0 1 0 0 0 1 0
B0004 1 0 0 0 0 1 0
B0188 0 0 0 0 0 0 0
B1278 0 0 0 0 0 0 0
B1372 0 0 0 0 0 0 0
B1722 1 1 0 0 0 0 0
(1) The first problem is that I can't import this data (.csv) into R
I runt the following code dt <- read_csv2("Org_ties.csv") The problem here is that while in the csv file the first column is left empty (it should be) -- when reading it into R, read_csv() generates a label for this column "X1". I do this in order to run the next code: g=graph_from_adjacency_matrix(dtmtrx, mode="directed", weighted = T) to produce a graph. However, I get the error message below. I think it has to do with the fact that I can't read it properly.
graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :
not a square matrix
In addition: Warning message:
In mde(x) : NAs introduced by coercion
(2) Another puzzling thing is that I cannot seem to transform the current data structure into an edge list. How can I do that? The edge list looks something like this
V1 V2 weight
A2654 B0004 1
A2654 B0188 0
A2654 B1278 0
A2654 B1372 0
A2654 B1722 1

Apply a function with if inside to a dataframe to take a value in a list in R

Hello everybody and thank you in advance for any help.
I inserted a txt file named "project" in R. This dataframe called "data" and consisted of 12 columns with some information of 999 households.
head(data)
im iw r am af a1c a2c a3c a4c a5c a6c a7c
1 0.00 20064.970 5984.282 0 38 0 0 0 0 0 0 0
2 15395.61 7397.191 0.000 42 30 1 0 0 0 0 0 0
3 16536.74 18380.770 0.000 33 28 1 0 0 0 0 0 0
4 20251.87 14042.250 0.000 38 38 1 1 0 0 0 0 0
5 17967.04 12693.240 0.000 24 39 1 0 0 0 0 0 0
6 12686.43 21170.450 0.000 62 42 0 0 0 0 0 0 0
im=male income
iw=female income
r=rent
am=male age
af=female age
a1c,a2c....a7c takes the value 1 when there is a child in age under 18
and the value 0 when there is not a child in the household.
Now i have to calculate the taxed income seperately for male and female for each houshold based on some criteria, so i am trying to create 1 function which calculate 2 numbers and after that to apply this function on my data frame and return a list with these numbers.
Specificaly I want something like this:
fact<-function(im,iw,r,am,af,a1c,a2c,a3c,a4c,a5c,a6c,a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
how can i fix this function in order to apply on my dataframe?
Can a function return 2 values?
how can i apply the function?
THEN I TRIED THIS:
fact<-function(im=data$im,iw=data$iw,r=data$r,am=data$am,af=data$af,a1c=data$a1c,a2c=data$a2c,a3c=data$a3c,a4c=data$a4c,a5c=data$a5c,a6c=data$a6c,a7c=data$a7c){
if ((am>0)&&(am<67)&&(af>0)) {mti<-im-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>0)&&(am<67)&&(af==0)) {mti<-im-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am>=67)&&(af>0)) {mti<-im-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((am<=67)&&(af==0)) {mti<-im-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am>0)) {fti<-iw-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>0)&&(af<67)&&(am==0)) {fti<-iw-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af>=67)&&(am>0)) {fti<-iw-1000-(r)/2-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
if ((af<=67)&&(am==0)) {fti<-iw-1000-r-(500*(a1c+a2c+a3c+a4c+a5c+a5c+a6c+a7c))}
return(mti,fti)}
fact(data[1,])
but i have tis error: Error in fact(data[1, ]) : object 'mti' not found
when i tried the function only for "fti" can run but wrongly.
Besides the need to return multiple values using c(mti, fti), your function doesn't have a default value if none of the conditions in the functions are TRUE. So, mti is never created.
Add mti <- NA at the start of your function, so NA is the default value.

How to fix rows order with pheatmap?

I have generate a heatmap with pheatmap and for some reasons, I want that the rows appear in a predefined order.
I see in previous posts that the solution is to set the paramater cluster_row to FALSE, and to order the matrix in the order we want, like this in my case:
Otu0085 Otu0086 Otu0087 Otu0088 Otu0091
AB200 0 0 0 0 0
2 91 0 2 1 0
20CF360 0 1 0 1 0
19CF359 0 0 0 2 0
11VP12 0 0 0 0 155
11VP04 4 1 0 0 345
However, when I do:
pheatmap(shared,cluster_rows = F)
My rows are sorted alphabetically, like this:
10CF278a
11
11AA07
11CF278b
11VP03
11VP04
11VP05
11VP06
11VP08
11VP09
ANy suggestions would be welcome
Thank's by advance

Rank the dataframe values in R

I want to fetch the row numbers of the first 5 highest values in a column in a dataframe and add a value of 100 on the same row number in another dataframe and rest other values as 0.
I know how to sort / order a column in a dataframe using order() function.
df=data.frame(a=c(345,948,290,189,3848,302,384,456,383,201,35,346,1.46,4.66,3,5,63,43,6432,4336,345,354,1245,342,523,743,248,932.5))
For example, df[order(-df$a),] results in
6432.00 4336.00 3848.00 1245.00 948.00 932.50 743.00 523.00 456.00 384.00 383.00 354.00 346.00 345.00 345.00 342.00 302.00 290.00 248.00 201.00 189.00 63.00 43.00 35.00 5.00 4.66 3.00 1.46
However, I am not able to meet my specific requirement.
I would expect to see df1 as
0 100 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 100 100 0 0 100 0 0 0 0 0
df$b <- ifelse(df$a %in% sort(df$a, T)[1:5], 100, 0)
We could use the rank function:
df$b <- (rank(-df$a) <= 5) * 100

Combining matrix of daily rows into weekly rows

I have a matrix with dates as row names and TAG#'s as column names. The matrix is populated with 0's and 1's for presence/absence.
eg
29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0
I have the following script for calculating site fidelity (% days present):
##Presence/absence data setup
##import file
read.csv('pn.csv')->'pn'
##strip out desired columns
pn[,c(5,7:9)]->pn
##create table of dates and tags
table(pn$Date,pn$Tag)->T
##convert to a matrix
as.matrix(T)->U
##convert to binary for presence/absence
1*(U>2)->U
##insert missing rows
library(micEcon)
insertRow(U,395,0)->U
rownames(U)[395]<-'2011-08-16'
insertRow(U,253,0)->U
rownames(U)[253]<-'2011-03-26'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-22'
insertRow(U,250,0)->U
rownames(U)[250]<-'2011-03-21'
##for presence/absence
##define i(tag or column)
1->i
##define place to store results
cbind(colnames(U),rep(NA,length(colnames(U))))->sfresult
##loop instructions
for(i in 1:ncol(U)){
##identify first detection day
grep(1,U[,i])[1]->tagrow
##count total days since first detection
nrow(U)-tagrow+1->days
##count days present
length(grep(1,U[,i]))->present
##calculate site fidelity
present/days->sfresult[i,2]
}
##change class of results column
as.numeric(sfresult[,2])->sfresult[,2]
##histogram
bins<-c(0,.3,.6,1)
xlab<-c('Low','Med','High')
hist(as.numeric(sfresult[,2]), breaks=bins,xaxt='n', col=heat.colors(3), xlab='Percent Days Present',ylab='Frequency (# of individuals)',main='Site Fidelity',freq=TRUE,labels=xlab)
axis(1,at=bins)
I'd like to calculate site fidelity on a weekly basis. I believe it would be easiest to simply collapse the matrix by combining every seven rows into a weekly matrix that simply sums the 0's and 1's from the daily matrix. Then the same script for site fidelity would calculate it on a weekly basis. Problem is I'm a newbie and I've had trouble finding an answer on how to collapse the daily matrix to a weekly matrix. Thanks for any suggestions.
Something like this should work:
x <- matrix(rbinom(1000,1,.2), nrow=50, ncol=20)
rownames(x) <- 1:50
colnames(x) <- paste0("id", 1:20)
require(data.table)
xdt <- as.data.table(x)
##assuming rows are sorted by date, that there are no missing days, and that the first row is the start of the week
###xdt[, week:=sort(rep(1:7, length.out=nrow(xdt)))] ##wrong
xdt[, week:=rep(1:ceiling(nrow(xdt)/7), each=7)] ##fixed
xdt[, lapply(.SD,sum), by="week",.SDcols=setdiff(names(xdt),"week")]
I can help you better preserve rownames if you provide a reproducible example How to make a great R reproducible example?
Edit:
Also, it's very atypical to use the right assignment -> as you do do above.
R's cut function will trim Dates to their week (see ?cut.Date for more details). After that, it's a simple call to aggregate to get the result you need. Note that cut.Date takes a start.on.monday option.
Data
sites <- read.table(text="29735 29736 29737 29738 29739 29740
2010-07-15 1 0 0 0 0 0
2010-07-16 1 1 0 0 0 0
2010-07-17 1 1 0 0 0 0
2010-07-18 1 1 0 0 0 0
2010-07-19 1 1 0 0 0 0
2010-07-20 1 1 0 0 0 0",
header=TRUE, check.names=FALSE, row.names=1)
Answer
weeks.factor <- cut(as.Date(row.names(sites)),
breaks='weeks', start.on.monday=FALSE)
aggregate(sites, by=list(weeks.factor), FUN=function(col) sum(col)/length(col))
# Group.1 29735 29736 29737 29738 29739 29740
# 1 2010-07-11 1 0.6666667 0 0 0 0
# 2 2010-07-18 1 1.0000000 0 0 0 0

Resources