omitting certain data in R to maintain overall data integrity

omitting certain data in R to maintain overall data integrity - r

I have a function that returns 50 data values, in a one column matrix, for each of 100 different data frames . However due to circumstance sometimes the function returns a "NaN" in one or more of the 50 values in a data frame . This perturbs the data as a data frame that has one or more NaN is now considered to have 49 or 48 columns.
df1 df2
112.4563 112.4563
110.1210 110.1210
109.2143 109.2143
NaN 108.1806 <- now uneven and can not perform iterations
107.3700 107.3700
How can I tell my computer/ subsequent commands when iterating through these 100 50 rowed data frames to "ignore" the NaN values in a way that each of the 100 will still be able to have 50 values and are consistently iterable? Or its it even possible to have a varying iteration range- for(i in 1:(47-50). So that the computer forgives the variance in row numbers?
this is also with respect to graphs.

As someone else has noted, it can also depend on what you want to do with the NaN value. However, on answering for an interative range, you can do something like the following. I'll be using the dataframe mtcars as an example.
df = mtcars
length(df$mpg)
length(rownames(df))
length(colnames(df))
If you need to iterate over the total number of rows in your data frame, you can use length(rownames(df)). If you need to iterate over the number of columns instead, you can use length(colnames(df)).
In a for loop, you would do the following:
for (i in length(rownames(df)){
# iterative code
}
This will iterate over the total number of rows in a given data frame.

Related

R: Seperating several observations of a variable and building a matrix

I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)

Aggregating functions which operate on non-data frame objects in R

I have a simple question. The aggregate() function in R operates on a dataframe based on the conditions specified.
aggregate(my.data.frame, list(desired column), function to be applied) is the default usage.
It is useful to compute simple functions like mean and median of a dataframe's column specific values. What I have, though, is a function which doesn't operate on dataframes, but I need to aggregate my dataframe after performing this function on a specific column. Let me show the dataset:
GPS Dataset
So I need to compute the centroid for the longitude and latitude points for EACH BSSID, I need to aggregate it that way. The functions I found online from various packages compute the centroid for a matrix of values and not a dataframe, whereas aggregate() doesn't work on non-dataframes.
Many thanks in advance :)

Aggregate works fine on matrices (and not just data frames).
Here's a reproducible example of your problem, using a matrix instead of a data frame:
my_matrix <- matrix(c(100,100,200,200,11,22,33,44,-1,-2,3,-4),
nrow=4,ncol=3,
dimnames=list(c(1,2,3,4),c('BSSID','lat','long')))
> my_matrix
BSSID lat long
1 100 11 -1
2 100 22 -2
3 200 33 -3
4 200 44 -4
> aggregate(cbind(lat,long)~BSSID,my_matrix,mean)
BSSID lat long
1 100 16.5 -1.5
2 200 38.5 -3.5
So that would be the mean (or the centroid) of the latitudes and longitudes for each BSSID. The cbind function (column-bind) lets you select multiple variables to be aggregated, similar to an Excel Pivot Table.
If still in doubt, you can always convert matrices to data-frames by using the as.data.frame() function and revert back to matrices using as.matrix() if needed.

I like dplyr for this - the syntax looks nice to me.
my.data.frame %>%
group_by(bssid) %>%
summarise(centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
If myfunction is fast, then this will work, but if it is slow, you probably want to rework it so that you only call the function once per bssid.
Edit to show alternative method without %>% operator
grouped.data.frame = group_by(my.data.frame, bssid)
summarised.data.frame = summarise(grouped.data.frame,
centroidlon = myfunction(lon, lat)[1],
centroidlat = myfunction(lon, lat)[2])
The %>% operator takes the left hand side, and passes it as the first argument to the right hand side. It's useful for chaining your statements together without getting confused by hundreds of nested brackets. It makes things easier to read, in my opinion.

sample() command is too slow in R

I want to create a random subset of a data.table df that is very large (around 2 million lines).
The data table has a weight column, wgt that indicates how many observation each line represents.
To generate the vector of row numbers I want to extract, I proceed as follows:
I get the exact number of observations :
ns<- length(df$wgt)
I get the number of desired lines (30% of the sample):
lines<-round(0.3*ns)
I compute the vector of probabilities:
pr<-df$wgt/sum(df$wgt)
And then I compute the vector of line numbers to get the subsample:
ssout<-sample(1:ns, size=lines, probs=pr)
The final aim is to subset the data using df[ssout,]. However, R gets stuck when computing ssout.
Is there a faster/more efficient way to do this?
Thank you!

I'm guessing that df is a summary description of a data set that has repeated observations (with wgt being the count of repetitions). In that case, the only useful way to sample from it would be with replacement; and a proper 30% sample would be 30% of the real population, .3*sum(wgt):
# example data
wgt <- sample(10,2e6,replace=TRUE)
nobs<- sum(wgt)
pr <- wgt/sum(wgt)
# select rows
system.time(x <- sample.int(2e6,size=.3*nobs,prob=pr,replace=TRUE))
# user system elapsed
# 0.20 0.02 0.22
Sampling rows without replacement takes forever on my computer, but is also something that I don't think one needs to do here.

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.

I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)

It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).

This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

I am trying to perform a multivariate test for normality on some density data from five sites, using mshapiro.test from the mvnormtest package. Each site is a column, and densities are below. It is 5 columns and 5 rows, with the top row as the header (site names). Here is how I loaded my data:
datafilename="/Users/megsiesiple/Documents/Lisa/lisadensities.csv"
data.nc5=read.csv(datafilename,header=T)
attach(data.nc5)`
The data look like this:
B07 B08 B09 B10 M
1 72571.43 17714.29 3142.86 22571.43 8000.00
2 44571.43 46857.14 49142.86 16857.14 7142.86
3 54571.43 44000.00 26571.43 6571.43 17714.29
4 57714.29 38857.14 32571.43 2000.00 5428.57
When I call mshapiro.test() for data.nc5 I get this message: Error in mshapiro.test(data.nc5) :
U[] is not a matrix with number of columns (sample size) between 3 and 5000
I know that to perform a Shapiro-Wilk test using mshapiro.test(), the data has to be in a numeric matrix, with a number of columns between 3 and 5000. However, even when I make the .csv a matrix with only numbers (i.e., when I omit the Site names), I still get the error. Do I need to set up the matrix differently? Has anyone else had this problem?
Thanks!

You need to transpose the data in a matrix, so that your variables are in rows, and observations in columns. The command will be :
M <- t(data.nc5[1:4,1:5])
mshapiro.test(M)
It works for me this way. The labels in the first row should be recognized during the import, so the data will start from row 1. Otherwise, there will be a "missing value" error.

If you read the numeric matrix into R via read.csv() using similar code to that you do show, it will be read in as a data frame, and that is not a matrix.
Try
mat <- data.matrix(data.nc5)
mshapiro.test(mat)
(Not tested as you don't give a reproducible example and it is late-ish in my time zone now ;-)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

omitting certain data in R to maintain overall data integrity - r

Related

R: Seperating several observations of a variable and building a matrix

Aggregating functions which operate on non-data frame objects in R

sample() command is too slow in R

Set values less than threshold to zero, with column-specific thresholds

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

Categories

Resources