Maybe I missed something in how tax_glom works but as I did not find any info here nor elsewhere on the web, maybe someone here can help.
I do not provide data but I can on request. Here is the code highlighting the issue I have
colSums(CYANO%>%otu_table())
CYANO_gen <- CYANO %>%
tax_glom(taxrank = "Genus")
colSums(CYANO_gen%>%otu_table())
CYANO is a phyloseq object that I wanted to agglomerate at the Genus rank but I noticed that a sample (named 100) was not present in a dataviz. This led me to check where the issue happened. 7 samples out of 54 present discrepancies as shown in the last line of the attached image, weird isn't it?
Results given by the code above and 2 additional lines which highlight the importance of discrepancies and the fact that this is not always the case
Thank, Guillaume
The NArm term in the tax_glom function is, by default, set as TRUE. To avoid losing observations with NA cells you need to set the NArm = FALSE.
Cheers
Related
I recently just started with R a few weeks ago at the Uni. We were given a problem which we had to solve. However in this problem, I find that there are two answers that fit the question:
Verify that you created lo_heval correctly (incl. missing values). Store your verification in the object proof2.
So i find this is correct:
proof2 <- soep[1:100, c("heval", "lo_heval")]
But I think that this answer is also correct:
proof2 <- table(soep$heval, soep$lo_heval, useNA = "always")
Instead of having to decide for one answer, how do I combine them both into the object? I tried to use &, but I get an error. I may be using it wrong.
Prof. if you're seeing this, please don't fail me. I just can't decide between them.
Thanks in advance!
R lists can hold any arbitrary objects in them, so you could use
proof2 <- list(
soep[1:100, c("heval", "lo_heval")],
table(soep$heval, soep$lo_heval, useNA = "always")
)
However, to my mind 100 rows of two columns isn't proof - it's an exercise to look through those and verify things are right. (And what about the rows past 100? It's a decent spot check, but if there are more rows in the data it is more strong evidence than proof.) The table approach, on the other hand, seems succinct and effective.
I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.
I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/
I'm working on a replication of the study for this particular data that you could find in this link, the data is named AProrok_AJPS.tab, please click on Download and then you can choose the RData format.
I want to remove all the rows whose value in a specific column is 1, so with this code:
df <- data[data$unknownleader!=1,]
After that, however, all the data becomes NA, it becomes all blank basically. I tried to change the type of data between integer, factor, class, etc. but all resulted into the same problem. I am not sure what is with this data file that causes this problem. Could anyone please investigate and show me a possible way to fix it?
Ok so thanks to #PaulHiemstra for pointing out that the problem arose from the NA in the dataset. Then, based on this thread, I could come up with a solution:
First replacing all the NA in that particular unknownleader column to 0:
df$unknownleader <- replace(df$unknownleader, is.na(df$unknownleader), 0)
Then proceed to remove the rows as mentioned in the question as normally:
df <- df[df$unknownleader==0, ]
Note that since the unknownleader variable happens to be binomial, therefore it still makes sense to replace NA to 0. For other dataset some appropriate adjustments might be needed.
There are several threads asking how to check if all elements in a vector are the same. That is not my issue.
I have been using the all function in R with no issues. I wanted to assess if all elements of a column mydataframe$colA are the same as in mydataframe$colB:
if(all(mydataframe$colA == mydataframe$colB) == FALSE) {...}
However today I started to see NA as being returned as result of the all function, instead of a boolean. I have tried other ways to find if all elements are the same or not. For example:
table(mydataframe$colA == mydataframe$colB) gives me only TRUE
So indeed all my values from column A are the same as the ones in column B.
What can be wrong here? I stress that my script has been working fine, and even today I ran these same lines 8 times before with data from different samples with no problem. All data and all samples are supposed to be in the exact same format of each other.
Thanks in advance!