create dataframe in for loop using dataframe array - r

I'm having a dataframe as like below. I need to extract df based on the region which is availabe in RL
>avg_data
region SN value
beta 1 32
alpha 2 44
beta 3 55
beta 4 60
atp 5 22
> RL
V1
1 beta
2 alpha
That dataframe should be in array something like REGR[beta] which should contain beta related information as like below
region SN value
beta 1 32
beta 3 55
beta 4 60
Similarly for REGR[alpha]
region SN value
alpha 2 44
So that I can pass REGR as a argument for plotting graph.
REGR <- data.frame()
for (i in levels(RL$V1)){
REGR[i,] <- avg_data[avg_data$region==i, ];
}
I did some mistake in the above code. Please correct me.. Thank you

The split function may be of interest to you. From the help page, split divides the data in the vector x into the groups defined by f.
So for your data, it may look something like:
> split(avg_data, avg_data$region)
$alpha
region SN value
2 alpha 2 44
$atp
region SN value
5 atp 5 22
$beta
region SN value
1 beta 1 32
3 beta 3 55
4 beta 4 60
If you want to filter out the records that do not occur in RL, I'd probably do that in a preprocessing step using the %in% function and [ for extraction:
x <- avg_data[avg_data$region %in% RL$V1,]
#-----
region SN value
1 beta 1 32
2 alpha 2 44
3 beta 3 55
That's what I'd feed to split if you want to drop atp.
The approach above may be overkill if you are just wanting to plot. Here's an example using sapply to iterate through each level of region and make a plot:
sapply(unique(x$region), function(z)
plot(x[x$region == z,"value"], main=z[1]))

Related

Integrate functions for depth integrated species abundance

Hei,
I am trying to calculate the organisms quantity per class over the entire depth range (e.g., from 10 m to 90 m). To do that I have the abundance at certain depths (e.g., 10, 30 and 90 m) and I use the integrate function which calculate:
the average of abundance between each pair of depths, multiplied by the difference of the pairs of depths. The values are summed up over the entire depth water column to get a totale abundance over the water column.
See an example (only a tiny part of bigger data set with several locations and year, more class and depths):
View(df)
Class Depth organismQuantity
1 Ciliates 10 1608.89
2 Ciliates 30 2125.09
3 Ciliates 90 1184.92
4 Dinophyceae 10 0.00
5 Dinoflagellates 30 28719.60
6 Dinoflagellates 90 4445.26
integrate = function(x) {
averages = (x$organismQuantity[1:length(x)-1] + x$organismQuantity[2:length(x)]) / 2
sum(averages * diff(x$Depth))
}
result = ddply(df, .(Class), integrate)
print(result)
But I got these result and warning message for lines with NA value :
Class V1
1 Ciliates 136640.1
2 Dinoflagellates NA
3 Dinophyceae 0.0
Warning messages:
1: In averages * diff(x$Depth) :
longer object length is not a multiple of shorter object length
I don't understand why Dinoflagellates got NA value... It is the same for several others class in my complete data set (for some class abundance the integration equation applies for others I got the warning message).
thank you for the help!!
Cheers,
Lucie
Here is a way using function trapz from package caTools, adapted to the problem.
#
# library(caTools)
# Author(s)
# Jarek Tuszynski
#
# Original, adapted
trapz <- function(DF, x, y){
x <- DF[[x]]
y <- DF[[y]]
idx <- seq_along(x)[-1]
as.double( (x[idx] - x[idx-1]) %*% (y[idx] + y[idx-1]) ) / 2
}
library(plyr)
ddply(df, .(Class), trapz, x = "Depth", y = "organismQuantity")
# Class V1
#1 Ciliates 136640.1
#2 Dinoflagellates 994945.8
#3 Dinophyceae NA
Data
df <- read.table(text = "
Class Depth organismQuantity
1 Ciliates 10 1608.89
2 Ciliates 30 2125.09
3 Ciliates 90 1184.92
4 Dinophyceae 10 0.00
5 Dinoflagellates 30 28719.60
6 Dinoflagellates 90 4445.26
", header = TRUE)

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

Resources