Creating a connection matrix from a data frame in R - r

I have some data in this form:
> agreers <- read.csv('agreers.csv')
> attach(agreers)
> head(agreers)
wain1 wain2 count
1 Founder36 Mnist10_269 673
2 Founder3 Mnist10_19 665
3 Mnist10_140 Mnist10_257 663
4 Founder1 Founder15 659
5 Founder21 Founder25 654
6 Founder15 Founder32 654
I created the data such that wain1 <= wain2, so each pair appears in the table only once. So this would be an undirected graph.
I want to create a connection matrix, like so:
Mnist10_269 Mnist10_19 Mnist10_257 . . .
Founder36 673 ? ?
Founder3 ? 665 ?
Mnist10_140 ? ? 663
. . .
where the ?'s will be zero if there isn't any data in agreers. So here's what I've tried:
> mat = matrix(0, nrow = length(unique(wain1)), ncol = length(unique(wain2)))
> rownames(mat) = unique(wain1)
> colnames(mat) = unique(wain2)
> for(i in as.integer(rownames(agreers))) mat[wain1[i], wain2[i]] = count[i]
It does something, i.e., mat gets updated with numbers, but the numbers aren't in the right place! For example, I would expect this to return 673.
> mat["Founder36","Mnist10_269"]
[1] 0
EDIT: Here's a bit more of the data file, to show the "duplicated levels in factors" problem. Note that Mnist10_140 appears twice in the first column, but with different values in the second column.
wain1,wain2,count
Founder36,Mnist10_269,673
Founder3,Mnist10_19,665
Mnist10_140,Mnist10_257,663
Founder1,Founder15,659
Founder21,Founder25,654
Founder15,Founder32,654
Mnist10_140,Mnist10_84,643
When processing just that subset of the data, I get warnings:
> agreers <- read.csv('temp.csv')
> connections <- xtabs(count ~ factor(wain1, levels = wain1) + factor(wain2, levels = wain2), agreers)
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated

If you like base R you can use table
df <- read.table(header=TRUE, text=' wain1 wain2 count
Founder36 Mnist10_269 673
Founder3 Mnist10_19 665
Mnist10_140 Mnist10_257 663
Founder1 Founder15 659
Founder21 Founder25 654
Founder15 Founder32 654')
tab <- with(df,table(factor(wain1, levels=unique(wain1)),
factor(wain2, levels=unique(wain2))))
tab[which(tab == 1)] = df$count
tab
Mnist10_269 Mnist10_19 Mnist10_257 Founder15 Founder25 Founder32
Founder36 673 0 0 0 0 0
Founder3 0 665 0 0 0 0
Mnist10_140 0 0 663 0 0 0
Founder1 0 0 0 659 0 0
Founder21 0 0 0 0 654 0
Founder15 0 0 0 0 0 654
EDIT
As #DavidArenburg suggests, you can also use xtabs
xtabs(count ~ factor(wain1, levels = unique(wain1)) + factor(wain2, levels = unique(wain2)), df)

Have a look at the package reshape2
library(reshape2)
agreers <- read.table(header = TRUE, stringsAsFactors = FALSE, sep = ',', text = "wain1,wain2,count\nFounder36,Mnist10_269,673\nFounder3,Mnist10_19,665\nMnist10_140,Mnist10_257,663\nFounder1,Founder15,659\nFounder21,Founder25,654\nFounder15,Founder32,654\n")
conMat <- dcast(agreers, wain1 ~ wain2, fill = 0)
rownames(conMat) <- conMat$wain1
conMat$wain1 <- NULL
conMat["Founder36","Mnist10_269"]
That should solve the problem.
EDIT
This does not result in sorted data. Have a look at #cdeterman solution instead

Here is a variation of #cdeterman's approach (df from the same post)
do.call(table, lapply(df[1:2], function(x)
factor(x, levels=unique(x))))*df[,3]
# wain2
# wain1 Mnist10_269 Mnist10_19 Mnist10_257 Founder15 Founder25 Founder32
# Founder36 673 0 0 0 0 0
# Founder3 0 665 0 0 0 0
# Mnist10_140 0 0 663 0 0 0
# Founder1 0 0 0 659 0 0
# Founder21 0 0 0 0 654 0
# Founder15 0 0 0 0 0 654

Related

UpSetR error when using queries: replacement has 1 row, data has 0

I have tried to use UpsetR to visualize the input file which can be found here
> library("UpSetR")
> orthogroups_df<- read.table("orthogroups.GeneCount.tsv", header=T, stringsAsFactors = F)
> #All species
> selected_species <- colnames(orthogroups_df)[2:(ncol(orthogroups_df) -1)]
> selected_species
[1] "Atha" "Cann" "NQLD" "Natt" "Ngla" "Nlab" "Nsyl" "Ntab" "Ntom" "Slyc" "Stub" "Vvin"
> head(orthogroups_df)
Orthogroup Atha Cann NQLD Natt Ngla Nlab Nsyl Ntab Ntom Slyc Stub Vvin Total
1 OG0000000 0 0 965 0 0 3 0 0 0 0 0 0 968
2 OG0000001 0 1 3 0 0 448 0 0 0 0 0 0 452
3 OG0000002 0 1 313 0 0 120 1 0 1 0 0 0 436
4 OG0000003 0 93 15 21 46 16 33 63 36 25 39 26 413
5 OG0000004 1 42 2 34 109 6 8 154 11 9 4 0 380
6 OG0000005 0 2 61 1 34 44 91 70 43 20 1 0 367
> ncol(orthogroups_df)
[1] 14
> orthogroups_df[orthogroups_df > 0] <- 1
> upset(orthogroups_df,
+ nsets = ncol(orthogroups_df),
+ sets = rev(c(selected_species)),
+ queries = list(list(query = intersects, params = list("NQLD", "Nlab", "Nsyl"), color = "#238c45", active = T),
+ list(query = intersects, params = list("NQLD", "Nlab"), color = "#ffd977", active = T)))
Error in `$<-.data.frame`(`*tmp*`, "freq", value = 45L) :
replacement has 1 row, data has 0
How is it possible to fix the above error?
We need to set the number of intersects - nintersects - to a higher number so that sets in query params can be shown.
By default nintersects is set to 40, and list("NQLD", "Nlab", "Nsyl") appears after 40 at 90th position, so we need a bigger number, here I tried with 90:
upset(orthogroups_df,
nsets = ncol(orthogroups_df),
sets = rev(c(selected_species)),
nintersects = 90,
queries = list(
list(query = intersects,
params = list("NQLD", "Nlab", "Nsyl"),
color = "red",
active = TRUE),
list(query = intersects,
params = list("NQLD", "Nlab"),
color = "blue",
active = TRUE)))

Nested xtab tables

I would like to produce nested tables for a multilevel factorial experiment. I have 10 paints examined for time to reach an end point under 4 levels of humidity, 3 temperatures and 2 wind speeds. Of course I have searched on line but without success.
Some sample code can be generated using:
## Made Up Data # NB the data is continuous whereas observations were made 40/168 so data is censored.
time3 <- 4*seq(1:24) # Dependent: times in hrs, runif is not really representative but will do
wind <- c(1,2) # Independent: factor draught on or off
RH <- c(0,35,75,95) # Independent: value for RH but can be processes as a factor
temp <- c(5,11,20) # Independent: value for temperature but can be processed as a factor
paint <- c("paintA", "paintB", "paintC") # Independent: Experimental material
# Combine into dataframe
dfa <- data.frame(rep(temp,8))
dfa$RH <- rep(RH,6)
dfa$wind <- rep(wind,12)
dfa$time3 <- time3
dfa$paint <- rep(paint[1],24)
# Replicate for different paints
dfb <- dfa
dfb$paint <- paint[2]
dfc <- dfa
dfc$paint <- paint[3]
dfx <- do.call("rbind", list(dfa,dfb,dfc))
# Rename first col
colnames(dfx)[1] <- "temp"
# Prepare xtab tables
tx <- xtabs(dfx$time3 ~ dfx$wind + dfx$RH + dfx$temp + dfx$paint)
tx
And the target I hope to obtain would be like this xtab example
This
tx <- xtabs(dfx$time3 ~ dfx$wind + dfx$RH + dfx$temp)
does not work well enough. I would also like to write to C:\file.csv for printing and reporting etc. Please advise on how to achieve the desired output.
You can paste the two variables you want to nest together. Since the items will be ordered lexicographically, you will need to zero-pad the temp variable, to get numerical ordering.
xtabs(time3~wind+paste(sprintf("%02d",temp),RH,sep=":")+paint,dfx)
, , paint = paintA
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144
, , paint = paintB
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144
, , paint = paintC
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144

r if else for loop with logical and boollean operators

I'm trying to use an if else statement to create a new column of binary data in my data frame, but what I get is all zeros...
command:
for(i in 1:nrow(asort)){
if(asort$recip==0 && asort$dist<.74){
asort$temp[i]<-0
} else{
asort$temp[i]<-1
}
}
#temp ends up being all 0's
In addition, I would actually like to ask something along the lines of this:
# if the data in the recip column = 0, and the distances is < 0.74, OR if the #data is greater than 1.85 give me a zero, else 1
for(i in 1:nrow(asort)){
if(asort$recip==0 && asort$dist<.74 || asort$dist>1.85){
asort$temp[i]<-0
} else{
asort$temp[i]<-1
}
}
> head(asort)
coordinates CLASS_ID Flight UFID dist nnid nnid2 observed recip temp
157 (285293.3, 4426017) 0 F4_ F4_156 0.3857936 158 F4_157 0 0 0
158 (285293.2, 4426017) 0 F4_ F4_157 0.3857936 157 F4_156 0 0 0
259 (285255, 4426014) 0 F4_ F4_258 0.5226039 261 F4_260 1 0 0
261 (285255, 4426014) 0 F4_ F4_260 0.5226039 259 F4_258 1 0 0
402 (285275.3, 4426004) 0 F4_ F4_401 0.5598427 403 F4_402 1 0 0
403 (285275.6, 4426004) 0 F4_ F4_402 0.5598427 402 F4_401 1 0 0
Using df data.frame
dist <- runif(10, 0.3, 2)
recip<- c(0,1,1,0,1,0,1,0,0,1)
df <- data.frame(dist, recip)
and ifelse
df$temp<-ifelse(df$dist < 0.74 & df$recip == 0 , 0,
ifelse(df$dist > 1.85 & df$recip == 0, 0, 1))
> head(df)
# dist recip temp
#1 1.1878002 0 1
#2 0.4147835 1 1
#3 1.3707311 1 1
#4 0.9008034 0 1
#5 1.0220149 1 1
#6 1.9384069 0 0

Product between two data.frames columns

I have two data.frames:
The first one is the coefficients of my regressions for each day:
> parametrosBase
beta0 beta1 beta4
2015-12-15 0.1622824 -0.012956819 -0.04637442
2015-12-16 0.1641884 -0.007914548 -0.06170213
2015-12-17 0.1623660 -0.005618474 -0.05914809
2015-12-18 0.1643263 0.005380472 -0.08533237
2015-12-21 0.1667710 0.003824588 -0.09040071
The second one is: the independent (x) variables:
> head(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
1 2015-12-15 21 1 0.5642792 0.2859359 0 0 0
2 2015-12-15 42 1 0.3606713 0.2831963 0 0 0
3 2015-12-15 63 1 0.2550200 0.2334554 0 0 0
4 2015-12-15 84 1 0.1943071 0.1883048 0 0 0
5 2015-12-15 105 1 0.1561231 0.1544524 0 0 0
6 2015-12-15 126 1 0.1302597 0.1297947 0 0 0
> tail(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
835 2015-12-21 2415 1 0.006799321 0.006799321 0 0 0
836 2015-12-21 2436 1 0.006740707 0.006740707 0 0 0
837 2015-12-21 2457 1 0.006683094 0.006683094 0 0 0
838 2015-12-21 2478 1 0.006626457 0.006626457 0 0 0
839 2015-12-21 2499 1 0.006570773 0.006570773 0 0 0
840 2015-12-21 2520 1 0.006516016 0.006516016 0 0 0
What i want is to multiply the beta0 column of "parametrosBase" by h0x column of "ir_dfSTORED" and store the result in the beta0_h0x column. And the same for the others: beta1 and beta4
The problem im facing is with the dates in "ind" column. This multiplication has to respect the dates.
So, once i change the day in "ir_dfSTORED" i have to change to the same day in "parametrosBase".
For example:
The first rowof "parametrosBase" df is
2015-12-15 0.1622824 -0.012956819 -0.04637442
is fixed for the 2015-12-15 day. And then i do the product. Once i enter on the 2015-12-16 day i will have to consider the second row of "parametrosBase" df.
How can i do this?
Thanks a lot. :)
Maybe you should merge the two datasets first:
parametrosBase$ind <- rownames(parametrosBase)
df <- merge(ir_dfSTORED,parametrosBase)
df <- within(df,{
beta0_h0x <- beta0*h0x
beta1_h0x <- beta1*h0x
beta4_h0x <- beta4*h0x
})
Since I don't know the structure of the data, you may have to convert the dates from rownames to a date format in order for the merge to work. Using ind as the name of the date in parametrosBase is key to making merge work, otherwise you'll have to specify the variables to merge by.

Using a column entry as a "selector" for datasets in R

My array looks like this:
Slide Index A B C DoseGroup
482 778 l 0 0 2 13Gy_p_75_42wk
483 778 r 0 0 2 13Gy_p_75_42wk
484 779 l 0 0 2 13Gy_p_75_42wk
485 779 r 0 0 2 13Gy_p_75_42wk
486 4700 l 2 2 2 14.25Gy_C_50pl_42wk
487 4700 r 0 0 1 14.25Gy_C_50pl_42wk
488 4701 l 0 0 1 14.25Gy_C_50pl_42wk
I would like to use the DoseGroup column's entries to be able to select the respective entries in the other columns. I would like to be able to tell R, e.g., "Do a wilcox.test between the 13Gy_p_75_42wk and the 14.25Gy_C_50pl_42wk datasets using column C."
How can I do this with R? Is there some kind of way to select all columns having the entry 14.25Gy_C_50pl_42wk?
I modified your data to add a third level in DoseGroup to make it more realistic.
txt <- "Slide Index A B C DoseGroup
778 l 0 0 2 13Gy_p_75_42wk
778 r 0 0 2 13Gy_p_75_42wk
779 l 0 0 2 13Gy_p_75_42wk
779 r 0 0 2 13Gy_p_75_42wk
4700 l 2 2 2 14.25Gy_C_50pl_42wk
4700 r 0 0 1 14.25Gy_C_50pl_42wk
4701 l 0 0 1 14.25Gy_C_50pl_42wk
4702 l 0 0 10 15Gy_C_50pl_42wk"
dat <- read.table(text = txt, header = TRUE)
wilcox.test(C ~ DoseGroup, data = dat,
subset = DoseGroup %in% c("13Gy_p_75_42wk", "14.25Gy_C_50pl_42wk"))
## Wilcoxon rank sum test with continuity correction
## data: C by DoseGroup
## W = 10, p-value = 0.1175
## alternative hypothesis: true location shift is not equal to 0
To select data, you can use one of these two command.
dat[dat$DoseGroup == "14.25Gy_C_50pl_42wk", ]
subset(dat, DoseGroup == "14.25Gy_C_50pl_42wk")
Those commands are basics in R and if you read any introduction to R, you'll be able to do same.
So I urge you to do so, I you want to really enjoy R.

Resources