Beginner: how can I repeat this function? - r

I need R studio for analysing some data, but haven't used it for 4 years now.
Now I've got a problem and don't know how to solve it. I want to calculate the variation of some columns together in every row. With some experimentation I've found this out:
var(as.numeric(data[1,8:33]))
and I get: 1.046154
As far as I know this should be right. It should at least give me the variation for the items 8 to 33 in the column for the first person. It also works for any other row:
var(as.numeric(data[5,8:33])) => 1.046154
Now I could of course use the same thing for every row individually, but I have 111 participants and several surveys. I tried to find a way to repeat the same command with every row but it didn't work.
How can I use the command from above and repeat it to all 111 participants?

Without the data it is difficult to help, but I created some dummy data using rnorm. You can use apply to obtain a vector containing the variance for each row. Since it appears that your data is in character format and not numeric, I created a simple function to automatically transform it and calculate the variance.
set.seed(20)
data <- matrix(as.character(rnorm(3663)),
ncol = 33,
nrow = 111)
##basic function
obtain_variance_from_character <- function(x){
return(var(as.numeric(x)))
}
##Calculate variances by row
variances <- apply(data_frame(data), MARGIN = 1, FUN = obtain_variance_from_character)

Related

Calculate ratios of all column combinations from a dataframe

I have a CVS file imported as df in R. dimension of this df is 18x11. I want to calculate all possible ratios between the columns. Can you guys please help me with this? I understand that either 'for loop" or vectorized function will do the job. The row names will remain the same, while column name combinations can be merged using paste. However, I don't know how to execute this. I did this in excel as it is still a smaller data set. A larger size will make it tedious and error prone in excel, therefore, I would like to try in R.
Will be great help indeed. Thanks. Let's say below is the data frame as subset from my data.
dfn = data.frame(replicate(18,sample(100:1000,15,rep=TRUE)))
If you do:
do.call("cbind", lapply(seq_along(dfn), function(y) apply(dfn, 2, function(x) dfn[[y]]/x)))
You will get an array that is 15 * 324, with 18 columns representing all columns divided by the first column, 18 columns divided by the second column, and so on.
You can keep track of them by labelling the columns with the following names:
apply(expand.grid(names(dfn), names(dfn)), 1, paste, collapse = " / ")

Creating a new column with random dates [duplicate]

I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable.
This can be achieved by separating the data frame by the levels and sample from each of those .
I thought ddply (data-frame to data-frame) would do it for me.
Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
The following commands perform the sampling...
When I enter...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
I get the following error
Error in [.data.frame(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
This error is because x inside the ddply function is not a vector but a dataframe.
Does anyone have any idea on how to achieve this sampling?
I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...
Thank you for your help...
I think what you want is to subset the data frame passed in x using sample:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.
It would seem that if you want to sample a category that has less than 20 rows, you'd need replace=TRUE...
This might do the trick:
ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])

How do I generate row-specific means in a data frame?

I'm looking to generate means of ratings as a new variable/column in a data frame. Currently every method I've tried either generates columns that show the mean of the entire dataset (for the chosen items) or don't generate means at all. Using the rowMeans function doesn't work as I'm not looking for a mean of every value in a row, just a mean that reflects the chosen values in a given row. So for example, I'm looking for the mean of 10 ratings:
fun <- mean(T1.1,T2.1,T3.1,T4.1,T5.1,T6.1,T7.1,T8.1,T9.1,T10.1, trim = 0, na.rm = TRUE)
I want a different mean printed for every row because each row represents a different set of observations (a different subject, in my case). The issues I'm looking to correct with this are twofold: 1) it generates only one mean, the mean of all values for each of the 10 variables, and 2) this vector is not a part of the dataframe. I tried to generate a new column in the dataframe by using "exp$fun" but that just creates a column whose every value (for every row) is the grand mean. Could anyone advise as to how to program this sort of row-based mean? I'm sure it's simple enough but I haven't been able to figure it out through Googling or trawling StackOverflow.
Thanks!
It's hard to figure out an answer without a reproducible example but have you tried subsetting your dataset to only include the 10 columns from which you'd like to derive your means and then using an apply statement? Something along the lines of apply(df, 1, mean) where the first argument refers to your dataframe, the second argument specifies whether to perform a function by rows (1) or columns (2), and the third argument specifies the function you wish to apply?

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.

Calculating moving average with different codes and different sizes

I have a data frame that contains data for different observations, where the observations are grouped with a unique code. As a reproducible example, here is how a simulated data looks like:
v <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6)
mat1 <- matrix(runif(200),40)
mat1 <- cbind(v,mat1)
mat1 <-as.data.frame(mat1)
names(mat1) <- c('code','x1','x2','x3','x4','x5')
unq <- unique(mat1$code)
What I would like to do is to calculate an average for each observation, based on two previous and two future observations (you can think about this as a time series). So for example
mat1$X1[3] = mean(mean(mat1$x1[1:5])
mat1$X1[4] = mean(mean(mat1$x1[2:6])
and so on. I am able to do the calculation using a particular code (for example when mat1$code==1):
K <- data.frame(code=mat1$code,x1=rep(0,40),x2=rep(0,40),x3=rep(0,40),x4=rep(0,40),x5=rep(0,40))
for ( i in 3:(nrow(mat1)-2)){
if(mat1$code[i]==unq[1]){
K[i,2] <- mean(mat1[i-2:i+2,2])
}
}
, but there are two things that I couldn't figure out:
(1) Since the actual dataset is much larger than the simulated one, how can I dynamically go through all the unique codes and do the calculation, noting that the first and last two observations of each unique code should be zero (and I will eventually get rid of them).
(2) The number of observations for each unique code is different, and some of them are less than 4, where in this case there can't be any calculation done for that code!
Any help is highly appreciated.
Thank you

Resources