Basic frequency value, split by group

Basic frequency value, split by group - r

I'm trying to get a length(x) value for a variable split by another variable.
aggregate(ssfia$Correlation_Abs,ssfia$Disorder,length,na.rm=TRUE)
However I get an error message:
>Error in FUN(X[[1L]], ...) :
2 arguments passed to 'length' which requires 1
It appears that length(x) can't be used there like "mean" or "sd" can. Is there a function that can count rows nested in an aggregate?
Thanks in advance!

What about the following?
split the first vector into subclasses determined by the second one:
cls <- split(ssfia$Correlation_Abs, ssfia$Disorder)
count how many non-NA observations fall into each subclass:
sapply(cls, function(dat) sum(!is.na(dat)))

Related

Trying to write a loop that makes a calculation for every vector in a dataframe

I am trying to write a loop that will perform a calculation on every value of every vector in a dataframe. Essentially, I am trying to standardize the values in the dataframe. I am trying to find the mean of each vector. Then I subtract that mean from the individual data values in each vector. Then I want to divide the difference (data value subtract mean of vector) by the standard deviation of the vector.
The expected result is that the mean is 0 and the standard deviation 1 for every individual vector in the dataframe.
I tried using this code:
for(i in colnames(metabolites)) {
metabolites<-metabolites %>%
(i-(mean(i)))/sd(i)
}
But it returns this error:
> for(i in colnames(metabolites)) {
+ metabolites<-metabolites %>%
+ (i-(mean(i)))/sd(i)
+ }
Error in i - (mean(i)) : non-numeric argument to binary operator
In addition: Warning message:
In mean.default(i) : argument is not numeric or logical: returning NA
Tried writing the loop a couple different ways. Expected it to produce a standardized dataset where every vector has its own mean of 0 and a standard deviation of 1

The issue is that in the for-loop, colnames(metabolites)[i] is each column name, a character variable. So you are passing the name of the column to mean, not the column values. Hence the error "non-numeric argument".
Column values are accessed using metabolites[, i] so something like this should work:
for(i in colnames(metabolites)) {
metabolites[, i] <- (metabolites[, i] - mean(metabolites[, i])) / sd(metabolites[, i])
}
You may also want to look at the scale function, or dplyr::mutate as a way to alter column values.

Finding the mean of a column

Using R, I am trying to find the mean of a column but I can't seem to get it to work. This is my code:
mean(data_frame$column, na.rm = TRUE)
When I run it it just gives me an error message: argument is not numeric or logical: returning NA. I've tried also using colMeans by it just give another error message: 'x' must be an array of at least two dimensions. What am I doing wrong?

i think you should normalize your data first, then calculate your mean
# remove NA value
data_frame <- data_frame[!(is.na(data_frame$column) | data_frame$column==""),]
# calculate the mean
mean <- mean(data_frame[["column"]])
let me know if this works for you

Subset data.table based on value in column of type list

So I have this case currently of a data.table with one column of type list.
This list can contain different values, NULL among other possible values.
I tried to subset the data.table to keep only rows for which this column has the value NULL.
Behold... my attempts below (for the example I named the column "ColofTypeList"):
DT[is.null(ColofTypeList)]
It returns me an Empty data.table.
Then I tried:
DT[ColofTypeList == NULL]
It returns the following error (I expected an error):
Error in .prepareFastSubset(isub = isub, x = x, enclos = parent.frame(), :
RHS of == is length 0 which is not 1 or nrow (96). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %in% instead.
(Just a precision my original data.table contains 96 rows, which is why the error message say such thing:
which is not 1 or nrow (96).
The number of rows is not the point).
Then I tried this:
DT[ColofTypeList == list(NULL)]
It returns the following error:
Error: comparison of these types is not implemented
I also tried to give a list of the same length than the length of the column, and got this same last error.
So my question is simple: What is the correct data.table way to subset the rows for which elements of this "ColofTypeList" are NULL ?
EDIT: here is a reproducible example
DT<-data.table(Random_stuff=c(1:9),ColofTypeList=rep(list(NULL,"hello",NULL),3))
Have fun!

If it is a list, we can loop through the list and apply the is.null to return a logical vector
DT[unlist(lapply(ColofTypeList, is.null))]
# ColofTypeList anotherCol
#1: 3
Or another option is lengths
DT[lengths(ColofTypeList)==0]
data
DT <- data.table(ColofTypeList = list(0, 1:5, NULL, NA), anotherCol = 1:4)

I have found another way that is also quite nice:
DT[lapply(ColofTypeList, is.null)==TRUE]
It is also important to mention that using isTRUE() doesn't work.

Error in seq.default(1, 1, length.out = nrow(x)) : argument 'length.out' must be of length 1

I am trying to make a simple function that finds outliers and marks the corresponding observation as valid.obs=1 if it is not an outlier,or valid.obs=0 if it is indeed an outlier.
For example, for the variable "income", the outliers will be identified based on the following formula: if
income>=(99percentile(income)+standard_deviation(income)), then it is an outlier.
If income<(99percentile(income)+standard_deviation(income)), then it is not an outlier.
rem= function(x){
u=quantile(x,probs=0.99,na.rm=TRUE) #calculating the 99th percentile
s=sapply(x,sd,na.rm=TRUE) #calculating the standard deviation
uc=u+s
v=seq(1,1,length.out = nrow(x))
v[x>=uc]=0
v[x<uc]=1
x$valid.obs=v
return(x)
}
I go on to apply this function to a single column of a dataframe. The dataframe has 132 variables with 5000 entries. I choose the variable "income"
apply(data["income"],2,rem)
It, then shows the error:
Error in seq.default(1, 1, length.out = nrow(x)) :
argument 'length.out' must be of length 1
Outside the function "rem", the following code works just fine:
nrow(data["income"])
[1] 5000
I am new to R and there aren't many functions in my armoury yet.The objective of this function is very simple. Please let me know why this error has crept in and if there is an easier way to go about this?

Use
v = rep(1, length.out = length(x))
apply iterates through "margins" or rows/columns of a data frame and passes the data frame columns as named vectors to FUN. The vector will have a length but not a row count.
ie. Inside rem you are passing
> nrow(c(1,2,3))
NULL
A few other things not directly related to your error:
For the same reason as above, there is no need to call sd inside sapply. Just call it normally on the vector.
s=sd(x,na.rm=TRUE) #calculating the standard deviation
You can also simplify three lines (and remove your initial problem entirely) by using
v=as.numeric(x<uc)
This will create a logical vector (automatically the same length as x) with TRUE/FALSE values based on <uc. To get your 0s and 1s just coerce the logical values with as.numeric
Finally, if all you need to do is add one column to data based on the values in income you want to return v instead and call the function like so
data$valid.obs <- rem(data$income)
Your function will now return a vector which can essentially be added to data under the new name of valid.obs

Using tapply on data with NAs

I have a data column (Percent.Plant.Parasites) that has some NAs. I want to take the mean of this data sorted by factor "Stage" (ie stage1 Mean=x, stage2 Mean=y, etc). I tried doing this using
tapply(rawdata$Percent.Plant.Parasites, rawdata$Stage, mean)
However, I get NAs because there are NAs in the data. I don't believe there is an na.rm option for tapply (is there?), so I tried to calculate the mean of each individual stage factor using:
mean(subset(rawdata,subset=Stage=="stage1")$Percent.Plant.Parasites, na.rm=TRUE)
to no avail. Instead I got the error:
In mean.default(subset(rawdata, subset = Stage == "Kax")$Percent.Plant.Parasites, :
argument is not numeric or logical: returning NA
However, when I do:
typeof(subset(rawdata,subset=Stage=="Kax")$Percent.Plant.Parasites)
I get integer
Any ideas where I'm going wrong?
Thanks.

Why not just create a new function, call it mean_NA, that simply removes the NAs before calculating the mean and then use that function in tapply? Something like:
mean_NA<-function(v){
avg<-mean(v, na.rm = T)
return (avg)
}
As was commented, make sure that the data you're taking the mean of is numeric/integer and the INDEX is factor(groups). You would use the newly created function like this:
tapply(X = rawdata$Percent.Plant.Parasites, INDEX = rawdata$Stage, mean_NA)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Basic frequency value, split by group - r

What about the following? split the first vector into subclasses determined by the second one: cls <- split(ssfia$Correlation_Abs, ssfia$Disorder) count how many non-NA observations fall into each subclass: sapply(cls, function(dat) sum(!is.na(dat)))

Related

Trying to write a loop that makes a calculation for every vector in a dataframe

Finding the mean of a column

Subset data.table based on value in column of type list

Error in seq.default(1, 1, length.out = nrow(x)) : argument 'length.out' must be of length 1

Using tapply on data with NAs

Categories

Resources