Import CSV and plotting ECDF - r

I'm new to R and I'm having some troubles on how to use Empirical Cumulative Distribution function.
I have a CSV file containing 100k values (exported from excel), which I'm importing like so:
MyData <- read.csv(file="test.csv", header=TRUE, sep=",")
which seems to be okay but as soon when I type
P = ecdf(MyData)
I get the error:
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
I've noticed MyData[1] outputs all my values and tried
P = ecdf(MyData[1]) but alas I get the same error.
I've searched around and it seems the error pops up in a lot of scenarios so I can't really find what the exact issue is, any help will be nice as I'm extremely new to this.

You should use either ecdf(MyData[, 1]) or ecdf(MyData[[1]]) because ecdf expects a numeric vector as intput. When you use MyData[1] R will print all values but it is a dataframe, not a vector.
From ecdf help file you can read that x, the input for ecdf should be a numeric vector.

At least from my reading of ecdf, the input is a vector. So you'll need to pass a vector from your dataframe by specifying the column. You can do this by doing P <- ecdf(MyData$col1), where col1 is the name of that factor, or by doing so numerically: P <- ecdf(MyData[,1], which subsets the data, to all rows of column 1.

Related

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

Can't get 'plotweb' in the Biparite package to work (R)

I am trying to visualise a biparite network using the biparite package in R. My data consists of 4 columns in a spreadsheet. The columns contain 1) plant species names2) bee species names 3) site 4) interaction frequency. I first read the data into R from a CSV file, then convert it to a web using the helper function frame2webs. When I then try to visualise the network with plotweb() I get the error message:
Error in web[rind, cind, drop = FALSE] : incorrect number of dimensions
My code looks like this:
library(bipartite)
bee <- read.csv('TestFile.csv')
bees <- as.data.frame(bee)
BeeWeb <- frame2webs(bees, type.out = "array")
plotweb(BeeWeb)
I've also tried:
BeeWeb <- frame2webs(bees,
varnames = c("higher","lower","webID","freq"),
type.out = "array")
Please help! I am new to R and am struggling to make this work. Cheers!
Not sure what your data look like, but this happens to me when I have a single factor level in either the "higher" or "lower" column, type.out is "list", and emptylist is TRUE.
This is due to a problem in empty, a function that frame2webs only calls when type.out is "list" and emptylist is TRUE. empty finds the dimensions of your data using NROW and NCOL, which interpret a single row of input as a vertical vector. When there's only one factor level in "lower" or "higher", the input to empty is a one-row array. empty interprets this row as a column, hence the 'incorrect number of dimensions' error.
Two simple workarounds:
Set type.out to "array"
Set emptylist to FALSE

A UPGMA cluster in R with NoData values

I have a matrix of sites. I want to develop a UPGMA aglomerative cluster. I want to use R and the vegan library for that. My matrix has sites in which not all the variables were measured.
Following a similar matrix of data:
Variable 1;Variable 2;Variable 3;Variable 4;Variable 5
0.5849774671338231;0.7962161133598957;0.3478909861199184;0.8027122599553912;0.5596553797833573
0.5904142034898171;0.18185393432022612;0.5503250366728479;NA;0.05657408486342197
0.2265148074206368;0.6345513807275411;0.8048128547418062;0.3303602674038131;0.8924461773052935
0.020429460126217602;0.18850489885886157;0.26412619465769416;0.8020472793070729;NA
0.006945970735023677;0.8404983401121199;0.058385134042814646;0.5750066564897788;0.737599672122899
0.9909722313946067;0.22356808747617019;0.7290078902086897;0.5621006367587756;0.3387823531518016
0.5932907022602052;0.899773235815933;0.5441346748937264;0.8045695319247985;0.6183003409599681
0.6520679140573288;0.5419713133237936;NA;0.7890033752744002;0.8561828607592286
0.31285906479192593;0.3396351688936058;0.5733594373520889;0.03867689654415574;0.1975784885854912
0.5045966366726562;0.6553489439611587;0.029929403932252963;0.42777351534900676;0.8787135401098227
I am planing to do it with the following code:
library(vegan)
# env <- read.csv("matrix_of_sites.csv")
env.norm <- decostand(env, method = "normalize") # Normalizing data here
env.ch <- vegdist(env.nom, method = "euclidean")
env.ch.UPGMA <- hclust(env.ch, method="average")
plot(env.ch.UPGMA)
After I run the second line, I get this error:
Error in x^2 : non-numeric argument to binary operator
I am not familiar with R, so I am not sure if this is due to the cells with no data. How can I solve this?
R does not think that data are numeric in your matrix, but at least some of them were interpreted as character variables and changed to factors. Inspect your data after reading int into R. If all your data are numbers, then sum(env) gives a numeric result. Use str() or summary() functions for detailed inspection.
From R's point of view, your data file has mixed formatting. R function read.csv assumes that items are separated by comma (,) and the decimal separator is period (.), and read.csv2 assumes that items are separated by colon (;) and decimal separator is comma ,. You mix these two conventions. You can read data formatted like that, but you may have to give both the sep and dec arguments.
If you get your data correctly in R, then decostand will stop with error: it does not accept missing values if you do not add na.rm = TRUE. The same also with the next vegdist command: it also needs na.rm = TRUE to analyse your data.

Plotting uneven row sizes in R

I have data in tab delimited rows of uneven length and I want to make a histogram for each row:
1    23    352    4    12    94    0    2
434    13    29
5    93    93    34
(...more rows)
This is what I currently have (no fanciness included):
data = read.delim(file.txt,header = F, sep="\t")
for (j in 1:nrow(data)) { #loop over each row
hist(data[j,])
But when I try to make the histogram, I think it tries to include the NA's in the row of the data frame, since R gives me the error message: "Error in hist.default(data[2, ]) : 'x' must be numeric".
When I try to use:
read.scan("file.txt, sep="\t")
I'm left with something I don't know how to separate by rows. Do I have a better option than splitting the file into one row per file and then reading in each row separately? (I am running into the same problem with uneven column size...)
The error results from the fact that grabbing a row from a data.frame yields an object of class data.frame (and hist() wants class numeric). Just convert it to numeric:
hist(as.numeric(data[j,]))

Error in R "undefined columns selected"

I am trying to initiate this code using the zoo command:
gld <- zoo(gld[,7], gld_dates)
Unfortunately I get an error message telling me this:
Error in `[.data.frame`(gld, , 7) : undefined columns selected
I want to use the zoo function to create zoo objects from my data.
The function should take two arguments: a vector of data and
a vector of dates.
This is the data I am using[LINK BROKEN].
I believe I have have 7 columns in my data set. Any ideas?
The code I am trying to implement is found here[LINK BROKEN].
Is their anything wrong with this code?
You don't say what your gld_dates is exactly, but if gld starts as your original data and you want to make a zoo object of the 7th column ordering by the 1st column (dates), I can do
gld_zoo <- zoo(gld[, 7], gld[, 1])
just fine. Equivalently, but with more readability,
gld_zoo <- zoo(gld$Adj.close, gld$Date)
reminds me what each column is.
Subsetting requires the names of the subset columns to match those in the data frame. This code subsets the dataset french_fries with potat instead of potato:
data("french_fries")
df_potato <- french_fries[, c("potatoes")]
and it fails with:
Error in `[.data.frame`(french_fries, , c("potatoes")) :
undefined columns selected
but using the right name potato works:
df_potato <- french_fries[, c("potato")]

Resources