R: Summing frequency in a list - r

Edit: Packages used are: plyr and vegan. R is most up to date version.
My base data is this:
X1 = c('Archea01', 'Bacteria01', 'Bacteria02')
Sample1 = c(0.2,NA,NA)
Sample2 = c(0, 0.001, NA)
Sample3 = c(0.04, NA, NA)
df = data.frame(X1,Sample1,Sample2,Sample3)
df
X1 Sample1 Sample2 Sample3
1 Archea01 0.2 0.000 0.04
2 Bacteria01 NA 0.001 NA
3 Bacteria02 NA NA NA
Data purposefully made with NAs, to reflect real data.
My goal is to sum the frequency of bacterial/archeal occurrence in each sample, which would ideally create this type of data frame:
Sample1 Sample2 Sample3
23 11 12
I have managed to create a list of frequency:
dfFreq <- apply(df, 2, count)
Although this looks good, it's not quite what I want:
head(dfFreq)[2]
$Sample2
x freq
1 0.000 23
2 0.001 5
3 <NA> 50
The next logical step would be to convert the list into a dataframe and sum frequency (or vice versa), but my code has not worked. I have tried:
df.data <- ldply (dfFreq, data.frame)
dfSUM <- apply(dfFreq, 2, sum)
Trying to sum the list simply hasn't worked (unsurprisingly). Regarding transforming into a dataframe, I have looked all over Stack Overflow and have seen a lot suggesting the above or lapply, but the data frame that is created from the code suggested is:
x freq
Archea01 1
Bacteria01 1
etc etc
Which is not what I want.
Any thoughts about how to either A) sum frequency and then convert into a data frame like the one I want, or B) convert the list into a sensible data frame whose frequency column can be summed? I think A is the only way I can get to the point I want, but any thoughts about this would be greatly appreciated.
Edit 2.0:
Ryan Morton suggested the following code:
require(dplyr)
dfBound <- rbind(dfFreq)
Which has resulted in this data frame:
X1 Sample1
dfFreq list(x = 1:1885, freq = c(1, 1, 1) list(x = c(1, 2, 3)
Although this certainly seems closer to the solution, I notice that each list either follows the format of X1, or the format of Sample1 (x = c(1,2,3, etc), which indicates that something wrong happened in the process of binding the lists.
Any ideas of why this may not be working, and what solution there may be for summing the frequency found within the list?
Thanks very much.

Update
I figured out how to sum my original frequency table and convert it into the data frame I was hoping for. Thanks to Ryan Morton for pointing me in the right direction and providing code.
dfNARemoved <- lapply(dfFreq, function(x) transform(x[-nrow(x),]))#removing useless NAs in my data
dfFreqxRemoved <- lapply(dfNARemoved, function(x) { x["x"] <- NULL; x }) #removing useless x column
dfSum <- lapply(dfFreqxRemoved, function(x) sum(x))
require(dplyr)
#Now converting into a dataframe
dfBound <- rbind(dfSum)
dfData <- as.data.frame(dfBound)

Related

dplyr::mutate changes row numbers, how to keep them?

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ
The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.
Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.
Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

how to check values in one column are all identical by a second grouping variable?

I am using r to analyse some data that is in long format. I have one column that is a grouping variable which contains participant IDs and another variable that contains their sex.
e.g.
ID SEX
1 M
1 M
2 F
2 F
2 M
I would like to check whether there are any IDs which do not have sex coded consistently e.g. ID=2 above. Is there a way to do this? I have been playing around with dplyr and the group_by function, but I am at a loss. Any help would be greatly appreciated.
In terms of output, I would probably like a vector of all unique ID values that have non-identical values in the SEX column.
Here's a base R soultion using ave() -
df[ave(df$SEX, df$ID, FUN = function(x) length(unique(x))) > 1, ]
ID SEX
3 2 F
4 2 F
5 2 M
You can try this.
require(plyr)
df <- data.frame(c(1,1,2,2,2), c('M','M','F','F','M'))
names(df) <- c('ID','SEX')
df2 <- ddply(df,.(ID), mutate, count = length(unique(SEX)))
unique(df2[df2$count > 1,][1])
Result:
ID
2

Add column to dataframe which is the sd of rnorm from previous columns

I have a data frame of two columns
set.seed(120)
df <- data.frame(m1 = runif(500,1,30),n1 = round(runif(500,10,25),0))
and I wish to add a third column that uses column n1 and m1 to generate a normal distribution and then to get the standard deviation of that normal distribution. I mean to use the values in each row of the columns n1 as the number of replicates (n) and m1 as the mean.
How can I write a function to do this? I have tried to use apply
stdev <- function(x,y) sd(rnorm(n1,m1))
df$Sim <- apply(df,1,stdev)
But this does not work. Any pointers would be much appreciated.
Many thanks,
Matt
Your data frame input looks like:
# > head(df)
# m1 n1
# 1 12.365323 15
# 2 4.654487 15
# 3 10.993779 24
# 4 24.069388 22
# 5 6.684450 18
# 6 15.056766 16
I mean to use the values in each row of the columns n1 and m1 as the number of replicates (n) and as the mean.
First show you how to use apply:
apply(df, 1, function(x) sd(rnorm(n = x[2], mean = x[1])))
But a better way is to use mapply:
mapply(function(x,y) sd(rnorm(n = x, mean = y)), df$n1, df$m1)
apply is ideal for matrix input; for data frame input you get great overhead for type conversion.
Another option
lapply(Map(rnorm,n=df$m1,mean=df$n1),sd)

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

R How to count occurrences of values across multiple columns of a data frame and save the columnwise counts from a particular value as a new row?

I have a large data-frame (approx 1,000 rows and 30,000 columns) that looks like this:
chr pos sample1 sample2 sample3 sample 4
1 5050 1 NA 0 0.5
1 6300 1 0 0.5 1
1 7825 1 0 0.5 1
1 8200 0.5 0.5 0 1
where at a given "chr"&"pos" the value for a given sample can take the form of 0, 0.5, 1, or NA. I have a large number of queries to perform that will require subsetting and ordering the data frame based on summaries of the values for each sample.
I would like to get a count of the number of occurrences of a given value (e.g. 0.5) for each column, and save that as a new row in my data frame. My ultimate goal is to be able to use the values of the new row to subset and/or order the columns of my data frame. I've seen similar questions about counting occurrences, but I can't seem to find/recognize a solution to doing this across all columns simultaneously and saving the column-wise counts for a particular value as a new row.
you can apply a function to all the column of you data.frame. Suppose you want to count the number of 'A' in each column of the data.frame d
#a sample data.frame
L3 <- LETTERS[1:3]
(d <- data.frame(cbind(x = 1, y = 1:10), fac = sample(L3, 10, replace = TRUE)))
# the function you are looking for
apply(X=d,2,FUN=function(x) length(which(x=='A')))
Very similar to #Jilber. Assumes your data is in a data frame df.
lst <- colnames(df[,-(1:2)])
count.na <- sapply(lst,FUN=function(x,df){sum(is.na(df[,x]))},df)
count.00 <- sapply(lst,FUN=function(x,df){sum(df[,x]==0,na.rm=T)},df)
count.05 <- sapply(lst,FUN=function(x,df){sum(df[,x]==0.5,na.rm=T)},df)
count.10 <- sapply(lst,FUN=function(x,df){sum(df[,x]==1.0,na.rm=T)},df)
df <- rbind(df,
c(NA,NA,count.na),
c(NA,NA,count.00),
c(NA,NA,count.05),
c(NA,NA,count.10))
You would probably want to replace the NA's in the last rbind(...) statement with something that identifies what you are counting.

Resources