Calculate mean value of subsets and store them in a vector for further analysis - r

Hullo, I've been working on a dataset for a while now, but am also kind of stuck. One question/answer here was already helpful, but I need to calculate the mean not for a single value, but sixty.
My dataset is basically this:
> data[c(1:5, 111:116), c(1:6, 85:87)]
plotcode block plot subsample year month Alo.pra Ant.odo Arr.ela
91 B1A01 B1 A01 1 2003 May 0 9 0
92 B1A02 B1 A02 1 2003 May 38 0 0
93 B1A03 B1 A03 1 2003 May 0 0 0
94 B1A04 B1 A04 1 2003 May 0 0 0
95 B1A05 B1 A05 1 2003 May 0 0 0
214 B2A16 B2 A16 2 2003 May 0 0 0
215 B2A17 B2 A17 2 2003 May 0 0 0
216 B2A18 B2 A18 2 2003 May 486 0 0
217 B2A19 B2 A19 2 2003 May 0 0 0
218 B2A20 B2 A20 2 2003 May 0 0 0
219 B2A21 B2 A21 2 2003 May 0 0 0
The first few columns are general data about the data point. Each plot has had up to 4 subsamples. The columns 85:144 are the data I want to calculate the means of.
I used this command:
tapply(data2003[,85] , as.factor(data2003$plotcode), mean, na.rm=T)
But like I said, I need to calculate the mean sixty times, for columns 85:144.
My idea was using a for–loop.
for (i in 85:144)
{
temp <- tapply(data2003[,i], data2003$plotcode, mean, na.rm=T)
mean.mass.2003 <- rbind(mean.mass.2003, temp)
}
But that doesn't work. I get multiple error messages, "number of columns of result is not a multiple of vector length (arg 2)".
What I basically want is a table in which the columns represent the species, with the rows as the plotcode and the actual entries in the fields being the respective means.

I figured and fiddled and had some help that worked as I wanted it. I know that's a kind of convoluted approach, but I only just started R, so I do like to understand what I code:
data.plots<-matrix(NA, 88,60) ## A new, empty matrix we'll fill with the loop
for (i in 85:144) # The numbers because that's where our relevant data is
{
temp <- tapply(data2007[,i], data2007$plotcode, mean, na.rm=T) # What tapply does in this instance: It calculates the mean value of the i-th column form data2003 for every row in which the plotcode is the same, ignoring NAs. temp will be a single row of values, obviously.
data.plots[,i-84]<-as.numeric(temp) # shunts the single row from temp we just calculated consecutively into data.plots.
}
colnames(data.plots) <- colnames(data[85:144])
rownames(data.plots) <- as.data.frame(table(data$plotcode))[,1] # the second part is basically a count() function, returning in the first column the unique entries found and in the second the frequency of that entry.
This works. It shunts the mean biomass per species into a temporary vector(? data frame? matrix?) as its being calculated for every unique entry in data2003$plotcode, and then overwrites consecutively the rows of the target matrix data.plots.
After naming the rows and columns of data.plots I can work with it without always having to remember each name.

Related

Cannot get completed dataset using imputeMCA

I use missMDA package to fill in multiple categorical columns. However, I cannot get any result from these two functions: estim_ncpMCA(test_fill) and imputeMCA(test_fill). The program keeps running without any progress bar or results popped out.
This is the sample from the dataset.
Hybrid G1 G5 G8 G9 G10
P1000:P2030 0 -1 0 1 0
P1006:P1384 0 0 0 0 1
P1006:P1401 0 NA NA NA 1
P1006:P1412 0 0 0 0 1
P1006:P1594 0 0 0 0 1
P1013:P1517 0 0 0 0 1
I am working on a genetic project in R. In this dataset, there are 497 rows and 11,226 columns. Every row is a genetic marker series for a particular hybrid, and every column is a genetic marker ("G1", "G2" and etc) with value 1, 0, -1 and NA. There are total 746,433 of missing values and I am trying to fill in the missing values by imputeMCA.
I also made some transformations on test_fill before running imputeMCA.
test_fill = as.matrix(test_fill)
test_fill[, -1] <- lapply(test_fill[, -1], as.numeric)
I am wondering whether this is the result of too many columns in my dataset. And do I need to transpose my columns and rows.
I don't know if you found your answer, but I think your function doesn't run because of your first column, which seems to be the label of the individuals. You can specify that it should not be taken into the analysis.
estim_ncpMCA(test_fill[,2:11226], ncp.max = 5)
imputeMCA(test_fill[,2:11226], ncp = X)
I hope this can help.

How do I delete columns from a matrix based on a list in R?

I have a very large matrix that looks like this:
AGAG AAGAA AGTG AGAT AAGAT AGTT
1001 14691 0 0 0 0 5
1002 13 12 0 5831 20473 4
1003 0 5831 20473 0 0 0
1004 0 7936 7936 7936 0 0
1005 16066 0 0 24 2 2
There are >8000 columns. I need to delete many (~3000) of these columns by the column name. I have abbreviated the column names here (genome sequences). Obviously, I can't type that many individually.
I made a separate table that has the column names I want to delete. For example:
AGAG
AGTG
AGTT
AGAT
I tried to use subset and %in% so far to no avail.
Here's an example with the iris dataset:
col_to_rm <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
col_to_keep <- setdiff(colnames(iris), col_to_rm )
iris[, col_to_keep]
iris is a dataframe, but it will work as well if you have a matrix.
You should be probably thinking whether that data structure (matrix) is the best to save your data: you've got very many columns...

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

subset all columns in a data frame less than a certain value in R

I have a dataframe that contains 7 p-value variables.
I can't post it because it is private data but it looks like this:
>df
o m l c a aa ep
1.11E-09 4.43E-05 0.000001602 4.02E-88 1.10E-43 7.31E-05 0.00022168
8.57E-07 0.0005479 0.0001402 2.84E-44 4.97E-17 0.0008272 0.000443361
0.00001112 0.0005479 0.0007368 1.40E-39 3.17E-16 0.0008272 0.000665041
7.31E-05 0.0006228 0.0007368 4.59E-33 2.57E-13 0.0008272 0.000886721
8.17E-05 0.002307 0.0008453 4.58E-18 5.14E-12 0.0008336 0.001108402
Each column has values from 0-1.
I would like to subset the entire data frame by extracting all the values in each column less than 0.009 and making a new data frame. If I were to extract on this condition, the columns would have very different lengths. E.g. c has 290 values less than 0.009, and o has 300, aa has 500 etc.
I've tried:
subset(df,c<0.009 & a<0.009 & l<0.009 & m<0.009& aa<0.009 & o<0.009)
When I do this I just end up with a very small number of even columns which isn't what I want, I want all values in each column fitting the subset criteria in the data.
I then want to take this data frame and bin it into p-value range groups by using something like the summary(cut()) function, but I am not sure how to do it.
So essentially I would like to have a final data frame that includes the number of values in each p-value bin for each variable:
o# m# l# c# a# aa# ep#
0.00-0.000001 545 58 85 78 85 45 785
0.00001-000.1 54 77 57 57 74 56 58
0.001-0.002 54 7 5 5 98 7 5 865
An attempt:
sapply(df,function(x) table(cut(x[x<0.009],c(0,0.000001,0.001,0.002,Inf))) )
# o m l c a aa ep
#(0,1e-06] 2 0 0 5 5 0 0
#(1e-06,0.001] 3 4 5 0 0 5 4
#(0.001,0.002] 0 0 0 0 0 0 1
#(0.002,Inf] 0 1 0 0 0 0 0

Include zero frequencies in frequency table for Likert data

I have a dataset with responses to a Likert item on a 9pt scale. I would like to create a frequency table (and barplot) of the data but some values on the scale never occur in my dataset, so table() removes that value from the frequency table. I would like it instead to present the value with a frequency of 0. That is, given the following dataset
# Assume a 5pt Likert scale for ease of example
data <- c(1, 1, 2, 1, 4, 4, 5)
I would like to get the following frequency table without having to manually insert a column named 3 with the value 0.
1 2 3 4 5
3 1 0 2 1
I'm new to R, so maybe I've overlooked something basic, but I haven't come across a function or option that gives the desired result.
EDIT:
tabular produces frequency tables while table produces contingency tables. However, to get zero frequencies in a one-dimensional contingency table as in the above example, the below code still works, of course.
This question provided the missing link. By converting the Likert item to a factor, and explicitly specifying the levels, levels with a frequency of 0 are still counted
data <- factor(data, levels = c(1:5))
table(data)
produces the desired output
table produces a contingency table, while tabular produces a frequency table that includes zero counts.
tabulate(data)
# [1] 3 1 0 2 1
Another way (if you have integers starting from 1 - but easily modifiable for other cases):
setNames(tabulate(data), 1:max(data)) # to make the output easier to read
# 1 2 3 4 5
# 3 1 0 2 1
If you want to quickly calculate the counts or proportions for multiple likert items and get your output in a data.frame, you may like the function psych::response.frequencies in the psych package.
Lets create some data (note that there are no 9s):
df <- data.frame(item1 = sample(1:7, 2000, replace = TRUE),
item2 = sample(1:7, 2000, replace = TRUE),
item3 = sample(1:7, 2000, replace = TRUE))
If you want to calculate the proportion in each category
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9)
you get the following:
1 2 3 4 5 6 7 8 9 miss
item1 0.1450 0.1435 0.139 0.1325 0.1380 0.1605 0.1415 0 0 0
item2 0.1535 0.1315 0.126 0.1505 0.1535 0.1400 0.1450 0 0 0
item3 0.1320 0.1505 0.132 0.1465 0.1425 0.1535 0.1430 0 0 0
If you want counts, you can multiply by the sample size:
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9) * nrow(df)
You get the following:
1 2 3 4 5 6 7 8 9 miss
item1 290 287 278 265 276 321 283 0 0 0
item2 307 263 252 301 307 280 290 0 0 0
item3 264 301 264 293 285 307 286 0 0 0
A few notes:
the default max is 10. Thus, if you have more than 10 response options, you'll have issues. Otherwise, in your case, and many Likert item cases, you could omit the max argument.
uniqueitems specifies the possible values. If all your values were present in at least one item, then this would be inferred from the data.
I think the function only works with numeric data. So if you have your likert categories coded "Strongly disagree", etc. it wont work.

Resources