R: filling matrix with values does not work - r

I have a data frame vec that I need to prepare for an image.plot() plot. The structure of vec is as follows:
> str(vec)
'data.frame': 31212 obs. of 5 variables:
$ x : int 8 24 40 56 72 88 104 120 136 152 ...
$ y : int 8 8 8 8 8 8 8 8 8 8 ...
$ dx: num 0 0 0 0 0 0 0 0 0 0 ...
$ dy: num 0 0 0 0 0 0 0 0 0 0 ...
$ d : num 0 0 0 0 0 0 0 0 0 0 ...
Note: the values in $dx, $dy and $d are not zero but only too small to be shown in this overview.
Background: the data is the output of a pixel tracking software. $x and $y are pixel coordinates while in $d are the displacement vector lengths (in pixels) of that pixel.
image.plot() expects as first and second argument the dimension of the matrix as ordered vectors, so I think sort(unique(vec$x)) and sort(unique(vec$y)) respectively should be good. So, I would like to end up with image.plot(sort(unique(vec$x)),sort(unique(vec$y)), data)
The third argument is the actual data. To build this I tried:
# spanning an empty matrix
data = matrix(NA,length(unique(vec$x)),length(unique(vec$y)))
# filling the matrix
data[match(vec$x, sort(unique(vec$x))), match(vec$y, sort(unique(vec$y)))] = vec$d
But, unfortunately, this isn't working. It reports no errors but data contains no values! This works:
for(i in c(1:length(vec$x))) data[match(vec$x[i], sort(unique(vec$x))), match(vec$y[i], sort(unique(vec$y)))] = vec$d[i]
But is very slow.
a) is there a better way to build data?
b) is there a better way to deal with my problem, anyways?

R allows indexing of a matrix by a two-column matrix, where the first column of the index is interpreted as the row index, and the second column as the column index. So create the indexes into data as a two-column matrix
idx = cbind(match(vec$x, sort(unique(vec$x))),
match(vec$y, sort(unique(vec$y))))
and use that
data[idx] = vec$d

Related

How to devide my dataset for using permanova

Hello everyone :) I have a data set with individuals that correspond in 5 different species in one column, and their presence/absence in different landscapes (7 other columns).
data.frame': 1212 obs. of 10 variables:
$ latitude : num 34.5 34.7 34.7 34.8 34.8 ...
$ longitude : num 127 127 127 127 127 ...
$ species : Factor w/ 5 levels "Bufo gargarizans",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Built : int 0 0 0 0 0 0 0 0 0 0 ...
$ Agriculture: int 1 1 0 1 0 0 1 0 0 0 ...
$ Forested : int 0 0 1 0 0 0 0 1 1 1 ...
$ Grassland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wetland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Bare : int 0 0 0 0 1 0 0 0 0 0 ...
$ Water : int 1 0 0 0 0 1 0 0 0 0 ...
I try to use permanova and then Tukey test to see if the species use the landscape differently or not. My supervisor did it on SPSS and it worked very well, so I have to do it on R.
I saw I need 2 csv files for running permanova on R but I have only one. I will give you the script that I found on internet and I want to use for my analysis.
library(vegan)
data(dune)
data(dune.env)
# default test by terms
adonis2(dune ~ Management*A1, data = dune.env)
In my case, I should have 1 dataframe with species and 1 dataframe with environmental variables, if I understand well.
However my presence/absence are inside the environmental categories (see the str of my table above). So if I create 1 dataframe with species only, I will not have numerical values in the dataframe with species.
So I am totally lost. I don't know how to process. Can someone help me please ? Thank you !
I will split my answer into two parts. The one where I know what I am talking about and one where I am brainstorming ;)
Here is the first part on how to split your data into two data.frames
# Set seed
seed(1312)
# Some sample data
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=500,replace=T),
Built=sample(c(0,1),size=500,replace=T),
Agriculture=sample(c(0,1),size=500,replace=T),
Forested=sample(c(0,1),size=500,replace=T),
Grassland=sample(c(0,1),size=500,replace=T),
Wetland=sample(c(0,1),size=500,replace=T),
Bare=sample(c(0,1),size=500,replace=T),
Water=sample(c(0,1),size=500,replace=T))
# Split data
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
Now part two: as it says in ?adonis2 the first part of adnonis2 is a formula where the left part of the formula must be a community data matrix or a dissimilarity matrix
Eventhough I am not sure if it does make sense, I went wild and followed the instructions :D
df2_dist <- dist(df2)
vegan::adonis2(df2_dist~species, data=df1)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
vegan::adonis2(formula = d2 ~ species, data = d1)
Df SumOfSqs R2 F Pr(>F)
species 4 5.333 0.15802 0.7038 0.88
Residual 15 28.417 0.84198
Total 19 33.750 1.00000
Of course this might be nonsense in terms of content as I took a purely technical approach here, but maybe it helps you to shape your data as required
So I made this code :
setwd("C:/Users/Johan/Downloads/memoire Johanna (1)/memoire Johanna")
xdata=read.csv(file="all10_reduce_focal species1.csv", header=T, sep=",")
str(xdata)
xdata$species <- as.factor(xdata$species)
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),
Agriculture=sample(c(0,1),size=1212,replace=T),
Forested=sample(c(0,1),size=1212,replace=T),
Grassland=sample(c(0,1),size=1212,replace=T),
Wetland=sample(c(0,1),size=1212,replace=T),
Bare=sample(c(0,1),size=1212,replace=T),
Water=sample(c(0,1),size=1212,replace=T))
library(dplyr)
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
df1_dist <- dist(df1)
vegan::adonis2(df1_dist~Built+Agriculture+Grassland+Forested+Wetland+Bare+Water, data=df2)
Species should be the response as I try to see the landscape on the species. When I do this I have :
Error in vegdist(as.matrix(lhs), method = method, ...) : input data must be numeric
It's because the "species" variable has only characters. So I changed to make it numeric :
sample_data <- dplyr::tibble(species=sample(c(1:5),size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),......
But the result I got is different as the result from SPSS, as I don't have any significant variable (in SPSS Built, Agriculture, Forested and Water are significant).
I think my code is wrong

How to organize my data to create a heatmap using correlation or cluster analysis (x must be numeric problem)

I need some help with generating heatmaps with cluster analysis and correlation (I am new to R). My data looks like this in Excel:
Gene1 Gene2 Gene3 Gene4 Gene5 ... Gene296
Bacteria1 0 0 0 0.7 0.2 ... 0
Bacteria2 0.44 0 0 0 0 ... 0.9
Bacteria2 0 0.32 0 0.4 0 ... 0
... ... ... ... ... ... ... ...
Bacteria117 0 0.2 0.3 0 0.7 ... 0
A value of 0.32 represents a score of 32 from 0 to 100. There are higher scores (0.9 for example) or lower scores (0 or 0.2 for example). I checked for NAs and there are none. I want to do cluster analysis to find out what bacteria form clusters according to my experimental data (scores). The file is CSV. I used this code:
> aa <- read.csv(file.choose())
> str(aa)
#I obtain this structure
'data.frame': 117 obs. of 296 variables:
$ X : Factor w/ 117 levels "Ac_neuii_BVI",..: 45 64 67 104 1 2 3 4 5 6 ...
$ AAC6_Iad : num 0 0 0 0 0 0 0 0 0 0 ...
$ aad6 : num 0 0 0 0 0 0 0 0 0 0 ...
$ abeS : num 0 0 0 0 0 0 0 0 0 0 ...
> is.numeric(aa)
[1] FALSE
When I try to use the correlation or the clustering I get this error:
> az <- cor(aa)
Error in cor(aa) : 'x' must be numeric
I tried as.matrix but the error continues in the matrix of course. I tried as.numeric but it didn't work. I erased X > aa$X <- NULL and the problem disappeared (I don't know if this is the correct way to solve the problem), but the name of the bacteria disapeared and then I get a correlation between my genes, not between my genes AND the bacteria. The same thing happens with the clustering using hclust or dist. Is there a way I should organize my csv file? I haven't found a clear article on the internet on how to solve the "x must be numeric problem" and on how to do the correlation or measuring the distances between the genes and the bacteria.
Thank you. Sorry for the ignorance on certain things that might appear obvious to you.
You can import the bacteria names as row.names:
aa <- read.csv(file.choose(), row.names = 1)
aa$X is not numeric (it contains factors). You can transform it with:
aa$X = as.numeric(aa$X)
Then az <- cor(aa) will run... but (as noted by #Cole) it does not make sense since X refers to the names of the bacteria.
You can set the first column to be the names of the rows with the row.names parameter of read.csv:
aa <- read.csv(file.choose(), row.names = 1)

Reorder a list of dataframes before rbind (R)

I'm working with R and I have a problem with rbinding dataframe.
My data come from a Json file and the first think I have done is to split it accordingly to Chromosome number
#Input
Control <- fromJSON(file=O5)
RNAi <- fromJSON(file=s25p5)
#Loop throug each chromosome
Control.1 <- lapply(Control, function(I)
{
data.frame(matrix(unlist(I),ncol = 1, byrow = TRUE))
})
The problem is that now I have a list of 6 data.frame but on a random order
str(Control.1)
List of 6
$ II :'data.frame': 1771887 obs. of 1 variable:
..$ matrix.unlist.I...ncol...1..byrow...TRUE.: num [1:1771887] 0 0 0 0 0 0 0 0 0 0 ...
$ I :'data.frame': 1507243 obs. of 1 variable:
..$ matrix.unlist.I...ncol...1..byrow...TRUE.: num [1:1507243] 0 0 0 0 0 0 0 0 0 0 ...
$ III :'data.frame': 1378370 obs. of 1 variable:
..$ matrix.unlist.I...ncol...1..byrow...TRUE.: num [1:1378370] 0 0 0 0 0 0 0 0 0 0 ...
etc.
I would like to reorder them in order to have $I as the first data.frame, then $II etc.
my aim is to use rbind after
Control.2 <-do.call(rbind,Control.1)
in order to have one data frame containing all the data frame but in the correct oder.
Does anybody have any idea how it could be done?
Thank you!
for alphabetical order you can use :
Control.2 <-do.call(rbind,Control.1[order(names(Control.1)))
or you can use any other function than order to sort the names vector.

Conditional input using read.table or readLines

I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R.
I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in.
I tried something like this
LINES <- 19
stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE,
encoding="UTF-8", nrows=LINES)
but as mentioned above, the value for LINES is different each time.
I also tried readLines as in this post, but had no luck with it.
Is there a way to integrate a condition in read.table, like (pseudo code)
if (first character == "AC") {
break read.table
}
Sorry if this looks strange, I don't have that much experience in scripting or coding.
Any help is appreciated, thanks a lot!
Greetz!
Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command:
Read plain lines and find start and stop row
Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line:
lines <- readLines( Datei )
start <- which(lines == "TEAM STATS") + 2
end <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2
lines <- lines[start:end]
Read the data to data.frame
In your case you meet a couple of complications:
Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name.
If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later.
You won't be able to use the first column as row.names since they are not unique.
The rows have unequal length, since the POS column is not filled everywhere.
:
tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE )
# fix the PLAYER column
tab$V2 <- paste( tab$V2, tab$V3 )
tab <- tab[-3]
Fix the header
Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name:
colns <- strsplit( lines[start], "\\s+" )[[1]]
colns[1] <- "code"
colnames(tab) <- colns
Fix cases were "POS" was empty
This is done by finding the rows which last cell contains NAs and shift them by one cell to the right:
colsToFix <- which( is.na(tab[, "SHO%"]) )
tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ]
tab[ colsToFix, 3 ] <- NA
> str(tab)
'data.frame': 25 obs. of 20 variables:
$ code : chr "93" "91" "61" "88" ...
$ PLAYER: chr "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ...
$ POS : chr "F" "F" "D" "F" ...
$ GP : chr "8" "6" "7" "8" ...
$ G : int 10 1 4 3 4 2 0 2 1 0 ...
$ A : int 5 11 5 5 3 4 6 3 3 4 ...
$ PTS : int 15 12 9 8 7 6 6 5 4 4 ...
$ PIM : int 12 10 12 6 2 36 37 29 6 0 ...
$ PPG : int 3 0 1 1 1 1 0 0 1 0 ...
$ PPA : int 1 5 2 2 1 2 4 2 1 1 ...
$ SHG : int 0 1 0 1 1 0 0 0 0 0 ...
$ SHA : int 0 0 1 0 1 0 0 1 0 0 ...
$ GWG : int 2 0 1 0 0 0 0 0 0 0 ...
$ FG : int 1 0 1 1 1 0 0 0 0 0 ...
$ OTG : int 0 0 0 0 0 0 0 0 0 0 ...
$ UAG : int 1 0 1 0 0 0 0 0 0 0 ...
$ ENG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOG : int 0 0 0 0 0 0 0 0 0 0 ...
$ SHOA : num 0 0 0 0 0 0 0 0 0 0 ...
$ SHO% : num 0 0 0 0 0 0 0 0 0 0 ...

How do I automatically specify the correct regression model when the number of variables differs between input data sets?

I have a working R program that will be used by my internal client for analysing their nutrient intake data. For each dataset that they have, they will re-run the R program.
A key part of the dataset is an nonlinear mixed method analysis, using nlmer from the lme4 package, that incorporates dummy variables for age. Depending on whether they will be analysing children or adults, the number of age band dummies in the formula will differ, although the reference age band dummy will always be the youngest. I think that the number of possible age bands ranges from 4 to about 6, so it's not a large range. It is a trivial matter to count the number of age band dummies, if I need to condition based on that.
What is the most efficient way for me to wrap the model-based code (the lmer that provides the starting parameter values, the function for the nlmer model, and the model specification in nlmer itself) so that the correct function and models are applied based on the number of age band dummies in the model? The other variables in the model are constant across datasets.
I've already got the program set up for automatically generating the relevant dummies and dropping those that aren't used in the current analysis. The program after the model is pretty well set up as automated as well. I'm just stuck on what to do with automating the two lme4-based analyses and function. These will only be run once for each dataset.
I've been wondering whether I need to write a function to contain all the lme4 related code, or whether there was an easier way. I would appreciate some pointers on how to do this. It took me one day to work out how to get the function working that I needed for the nlmer model, so I am still at a beginner level with functions.
I've searched for other R related automation questions on the site and I didn't find anything similar to what I would like to do.
Thanks in advance.
Update in response to suggestion in the comments about using a string. That sounds like an easy way forward for me, except that I don't then know how to apply the string content in a function as each dummy variable level (excluding the reference category) is used in the function for nlmer. How can I pull apart the string and use only the dummy variables that I have in a function? For example, one analysis might have AgeBand2, AgeBand3, AgeBand4, and another analysis might have AgeBand5 as well as those 3? If this was VBA, I would just create subfunctions based on the number of age dummy variables. I have no idea how to do this efficiently in R.
Can I just wrap a while loop around the lmer, function, and nlmer parts, so I have a series of while loops?
This is the section of code I wish to automate, the number of AgeBand dummy variables differs depending on the dataset that will be analysed (children vs. adults). This is using the dataset that I have been testing a SAS to R translation on, but the real datasets will be very similar. It is necessary to have a nonlinear model as this is the basis of the peer-reviewed published method that I am working off.
library(lme4)
Male.lmer <- lmer(BoxCoxXY ~ AgeBand4 + AgeBand5 + AgeBand6 + AgeBand7 +
AgeBand8 + Race1 + Race3 + Weekend + IntakeDay + (1|RespondentID),
data=Male.AddSugar,
weights=Replicates)
Male.lmer.fixef <- fixef(Male.lmer)
Male.lmer.fixef <- as.data.frame(Male.lmer.fixef)
bA <- Male.lmer.fixef[1,1]
bB <- Male.lmer.fixef[2,1]
bC <- Male.lmer.fixef[3,1]
bD <- Male.lmer.fixef[4,1]
bE <- Male.lmer.fixef[5,1]
bF <- Male.lmer.fixef[6,1]
bG <- Male.lmer.fixef[7,1]
bH <- Male.lmer.fixef[8,1]
bI <- Male.lmer.fixef[9,1]
bJ <- Male.lmer.fixef[10,1]
MD <- deriv(expression(b0 + b1*AgeBand4 + b2*AgeBand5 + b3*AgeBand6 +
b4*AgeBand7 + b5*AgeBand8 + b6*Race1 + b7*Race3 + b8*Weekend + b9*IntakeDay),
namevec=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9"),
function.arg=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9",
"AgeBand4","AgeBand5","AgeBand6","AgeBand7","AgeBand8",
"Race1","Race3","Weekend","IntakeDay"))
Male.nlmer <- nlmer(BoxCoxXY ~ MD(b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,AgeBand4,AgeBand5,AgeBand6,AgeBand7,AgeBand8,
Race1,Race3,Weekend,IntakeDay)
~ b0|RespondentID,
data=Male.AddSugar,
start=c(b0=bA, b1=bB, b2=bC, b3=bD, b4=bE, b5=bF, b6=bG, b7=bH, b8=bI, b9=bJ),
weights=Replicates
)
These will be the required changes between the datasets:
the number of fixed effect coefficients that I need to assign out of the lmer will change.
in the function, the expression, name.vec, and function.arg parts will change
the nlmer, the model statement and start parameter list will change.
I can change the lmer model statement so it takes AgeBand as a factor with levels, but I still need to pull out the values of the coefficients afterwards.
str(Male.AddSugar) gives:
'data.frame': 10287 obs. of 23 variables:
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ RNDW : int 26290 7237 10067 75391 1133 31298 20718 23908 7905 1091 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ DRXTNUMF : int 27 11 13 18 17 13 13 19 11 11 ...
$ DRDDAYCD : int 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RIAGENDR : int 1 1 1 1 1 1 1 1 1 1 ...
$ RIDAGEYR : int 39 23 16 44 13 36 16 60 13 16 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ WKEND : int 1 1 1 0 1 0 0 1 1 1 ...
$ AmtInd : num 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeDay : num 0 0 0 0 0 0 0 0 0 0 ...
$ Weekend : int 1 1 1 0 1 0 0 1 1 1 ...
$ Race1 : num 0 0 0 0 0 0 0 0 0 1 ...
$ Race3 : num 0 1 0 0 1 0 0 0 0 0 ...
$ AgeBand4 : num 0 0 1 0 0 0 1 0 0 1 ...
$ AgeBand5 : num 0 1 0 0 0 0 0 0 0 0 ...
$ AgeBand6 : num 1 0 0 1 0 1 0 0 0 0 ...
$ AgeBand7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ AgeBand8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ YN : num 1 1 1 1 1 1 1 1 1 1 ...
$ BoxCoxXY : num 7.68 1.13 3.67 8.79 9.98 ...
The AgeBand data is incorrectly shown as the ordered factor Subgroup. Because I haven't used it, I haven't gone back and correct this to a plain factor.
This assumes that you have one variable, "ageband", which is a factor with levels: AgeBand2, AgeBand3, AgeBand4, and perhaps others that you want to be ignored. Since factors are generally treated by R regression functions using the lowest lexigraphic values as the reference levels, you would get your correct level chosen automagically. You pick your desired levels by creating a dataset hat has only the desired levels.
agelevs <- c("AgeBand2", "AgeBand3", "AgeBand4")
dsub <- subset(inpdat, ageband %in agelevs)
res <- your_fun(dsub) nlmer(y ~ ageband + <other-parameters>, data=dsub, ...)
If you have gone to the trouble of creating separate variables, then you need to learn to use factors correctly rather than holding to inefficent habits enforced by training in SPSS or other clunky macro processors.

Resources