Density for multiple columns at a time in r - r

I have an input file like this
V1 V2 V3 V4.............V60
11 22 33 44.............89
21 98 22 33.............09
33 44 55 78.............20
The above file has more than 3000 rows with 60 columns in each row.
When I try using density(data, kernel="gaussian", bw=15) at my r prompt, it is generating an error saying
Error in density.default(data) : argument 'x' must be numeric
But, when I try density(data$V1, kernel="gaussian", bw=15), it works fine.
I was wondering if there is a single command to calculate the density of entire file instead of doing it for every single column 60 times.

you might be looking for sapply or apply.
you can use
apply(myDataName, 2, density, kernel="gaussian", bw=15)
if your columns are factors instead of numeric, you will need to convert those first.

Most likely your data object is a data frame (this is the default when reading data using tools like read.table and read.csv).
If you want to process each column (create a seperate density plot for each column) then you can use the lapply function.
If you want one single density based on all the data (the columns don't mean anything), then you can use the unlist function to convert it all to one big vector. Better may be to use the scan function instead of read.table to load the data into a vector to begin with and skip the data frame all together.

Related

Rowsums isn't adding correctly?

I have a presence absence database with a bunch of zeroes and ones, but when i use rowsums, it seems to only count a portion of the data and then stop. Here's my code
site_matrix=read.csv("TriassicMatrix1.csv", header=T) # create object called site_matrix
summary(site_matrix) # get summary
head(site_matrix) # check out first few columns
tail(site_matrix) # check out last few columns
View(site_matrix) # take a look at whole dataset in new window
# here's the problematic line
spp_rich=rowSums(site_matrix [,2:25]) # generate richness for sites
There are 25 rows of data, and it gives me incorrect output, such as suggesting the first row only has 4 occurances when it has 7.
I tried changing it to [,1,25] and it won't work since row 1 is my title row, so I know it's not that. When I view the data within R i can very easily go to row 2 and count out the data, since there is only a few hundred columns.
It appears to be 'cutting off' at about the halfway point, column-wise.

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

Aggregating functions which operate on non-data frame objects in R

I have a simple question. The aggregate() function in R operates on a dataframe based on the conditions specified.
aggregate(my.data.frame, list(desired column), function to be applied) is the default usage.
It is useful to compute simple functions like mean and median of a dataframe's column specific values. What I have, though, is a function which doesn't operate on dataframes, but I need to aggregate my dataframe after performing this function on a specific column. Let me show the dataset:
GPS Dataset
So I need to compute the centroid for the longitude and latitude points for EACH BSSID, I need to aggregate it that way. The functions I found online from various packages compute the centroid for a matrix of values and not a dataframe, whereas aggregate() doesn't work on non-dataframes.
Many thanks in advance :)
Aggregate works fine on matrices (and not just data frames).
Here's a reproducible example of your problem, using a matrix instead of a data frame:
my_matrix <- matrix(c(100,100,200,200,11,22,33,44,-1,-2,3,-4),
nrow=4,ncol=3,
dimnames=list(c(1,2,3,4),c('BSSID','lat','long')))
> my_matrix
BSSID lat long
1 100 11 -1
2 100 22 -2
3 200 33 -3
4 200 44 -4
> aggregate(cbind(lat,long)~BSSID,my_matrix,mean)
BSSID lat long
1 100 16.5 -1.5
2 200 38.5 -3.5
So that would be the mean (or the centroid) of the latitudes and longitudes for each BSSID. The cbind function (column-bind) lets you select multiple variables to be aggregated, similar to an Excel Pivot Table.
If still in doubt, you can always convert matrices to data-frames by using the as.data.frame() function and revert back to matrices using as.matrix() if needed.
I like dplyr for this - the syntax looks nice to me.
my.data.frame %>%
group_by(bssid) %>%
summarise(centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
If myfunction is fast, then this will work, but if it is slow, you probably want to rework it so that you only call the function once per bssid.
Edit to show alternative method without %>% operator
grouped.data.frame = group_by(my.data.frame, bssid)
summarised.data.frame = summarise(grouped.data.frame,
centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
The %>% operator takes the left hand side, and passes it as the first argument to the right hand side. It's useful for chaining your statements together without getting confused by hundreds of nested brackets. It makes things easier to read, in my opinion.

omitting certain data in R to maintain overall data integrity

I have a function that returns 50 data values, in a one column matrix, for each of 100 different data frames . However due to circumstance sometimes the function returns a "NaN" in one or more of the 50 values in a data frame . This perturbs the data as a data frame that has one or more NaN is now considered to have 49 or 48 columns.
df1 df2
112.4563 112.4563
110.1210 110.1210
109.2143 109.2143
NaN 108.1806 <- now uneven and can not perform iterations
107.3700 107.3700
How can I tell my computer/ subsequent commands when iterating through these 100 50 rowed data frames to "ignore" the NaN values in a way that each of the 100 will still be able to have 50 values and are consistently iterable? Or its it even possible to have a varying iteration range- for(i in 1:(47-50). So that the computer forgives the variance in row numbers?
this is also with respect to graphs.
As someone else has noted, it can also depend on what you want to do with the NaN value. However, on answering for an interative range, you can do something like the following. I'll be using the dataframe mtcars as an example.
df = mtcars
length(df$mpg)
length(rownames(df))
length(colnames(df))
If you need to iterate over the total number of rows in your data frame, you can use length(rownames(df)). If you need to iterate over the number of columns instead, you can use length(colnames(df)).
In a for loop, you would do the following:
for (i in length(rownames(df)){
# iterative code
}
This will iterate over the total number of rows in a given data frame.

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

I am trying to perform a multivariate test for normality on some density data from five sites, using mshapiro.test from the mvnormtest package. Each site is a column, and densities are below. It is 5 columns and 5 rows, with the top row as the header (site names). Here is how I loaded my data:
datafilename="/Users/megsiesiple/Documents/Lisa/lisadensities.csv"
data.nc5=read.csv(datafilename,header=T)
attach(data.nc5)`
The data look like this:
B07 B08 B09 B10 M
1 72571.43 17714.29 3142.86 22571.43 8000.00
2 44571.43 46857.14 49142.86 16857.14 7142.86
3 54571.43 44000.00 26571.43 6571.43 17714.29
4 57714.29 38857.14 32571.43 2000.00 5428.57
When I call mshapiro.test() for data.nc5 I get this message: Error in mshapiro.test(data.nc5) :
U[] is not a matrix with number of columns (sample size) between 3 and 5000
I know that to perform a Shapiro-Wilk test using mshapiro.test(), the data has to be in a numeric matrix, with a number of columns between 3 and 5000. However, even when I make the .csv a matrix with only numbers (i.e., when I omit the Site names), I still get the error. Do I need to set up the matrix differently? Has anyone else had this problem?
Thanks!
You need to transpose the data in a matrix, so that your variables are in rows, and observations in columns. The command will be :
M <- t(data.nc5[1:4,1:5])
mshapiro.test(M)
It works for me this way. The labels in the first row should be recognized during the import, so the data will start from row 1. Otherwise, there will be a "missing value" error.
If you read the numeric matrix into R via read.csv() using similar code to that you do show, it will be read in as a data frame, and that is not a matrix.
Try
mat <- data.matrix(data.nc5)
mshapiro.test(mat)
(Not tested as you don't give a reproducible example and it is late-ish in my time zone now ;-)

Resources