How to add a function inside sum() in R language - r

I have a dataframe:
SampleName <- c(A,A,A,A,B)
NumberofSample <- c(1,2,3,1,4)
SampleResult <- c(3,6,12,12,14)
Data <- data.frame(SampleName,NumberofSample,SampleResult)
head(Data)
SampleName NumberofSample SampleResult
1 A 1 3
2 A 2 6
3 A 3 12
4 A 1 12
4 B 4 14
My idea is: when SampleResult <15 && SampleResult >5, Sample A has 6 sample sites which match the condition, and Sample B has 4 sample sites which match it. So the ideal results would look like this:
SampleName Frequency
1 A 6
2 B 4
I write something like:
D1<- aggregate(SampleResult~SampleName, Data, function(x)sum(x<15 && x>5))
But I feel this lack something like
x * Data$NumberofSample[x]
So my question is what's the right way to code? Thank you

We can use dplyr. Grouped by 'SampleName', subset the 'NumberofSample' that meets the condition based on 'SampleResult' and get the sum
library(dplyr)
Data %>%
group_by(SampleName) %>%
summarise(Frequency = sum(NumberofSample[SampleResult < 15 &
SampleResult > 5]))
# A tibble: 2 x 2
# SampleName Frequency
# <chr> <int>
#1 A 6
#2 B 4
If we prefer the aggregate
aggregate(cbind(Frequency = NumberofSample * (SampleResult < 15 &
SampleResult > 5)) ~ SampleName, Data, sum)
# SampleName Frequency
#1 A 6
#2 B 4
Note that the output of && is a single TRUE/FALSE value
(1:3 > 1) && (2:4 > 2)
instead of a logical vector of the same length

akrun’s solution is spot-on. But it so happens that {dplyr} offers a convenience function for this kind of computation: count.
In its most common form it counts the number of rows in each group. However, it can also perform a weighted sum, and in your case we simply weight by whether the SampleResult is between your chosen bounds:
Data %>% count(
SampleName,
wt = NumberofSample[SampleResult > 5 & SampleResult < 15]
)

Maybe the following form of aggregate is simpler. I subset Data based on the condition you want and then take the length of each group.
inx <- with(Data, 5 < SampleResult & SampleResult < 15)
aggregate(SampleResult ~ SampleName, Data[inx, ], length)
#SampleName SampleResult
#1 A 3
#2 B 1
Another possibility would be
subData <- subset(Data, 5 < SampleResult & SampleResult < 15)
aggregate(SampleResult ~ SampleName, subData, length)
but I think the logical index solution is better since its memory usage is smaller.

Related

Looping over multiple columns to generate a new variable based on a condition

I am trying to generate a new column (variable) based on the value inside multiple columns.
I have over 60 columns in the dataset and I wanted to subset the columns that I want to loop through.
The column variables I am using in my condition at all characters, and when a certain pattern is matched, to return a value of 1 in the new variable.
I am using when because I need to run multiple conditions on each column to return a value.
CODE:
df read.csv("sample.csv")
*#Generate new variable name*
df$new_var <- 0
*#For loop through columns 16 to 45*
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
grepl("I8501", df[[i]]) ~ 1
))
}
This does not work as when I table the results, I only get 1 value matched.
My other attempt was using:
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
df[[i]] == "I8501" ~ 1
))
}
Any other possible ways to run through multiple columns with multiple conditions and change the value of the variable accordingly? to be achieved using R ?
If I'm understanding what you want, I think you just need to specify another case in your case_when() for keeping the existing values when things don't match "I8501". This is how I would do that:
df$new_var <- 0
for (index in (16:45)) {
df <- df %>%
mutate(
new_var = case_when(
grepl("I8501", df[[index]]) ~ 1,
TRUE ~ df$new_var
)
)
}
I think a better way to do this though would be to use the ever useful apply():
has_match = apply(df[, 16:45], 1, function(x) sum(grepl("I8501", x)) > 0)
df$new_var = ifelse(has_match, 1, 0)
Kindly check if this works for your file.
Sample df:
df <- data.frame(C1=c('A','B','C','D'),C2=c(1,7,3,4),C3=c(5,6,7,8))
> df
C1 C2 C3
1 A 1 5
2 B 7 6
3 C 3 7
4 D 4 8
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.numeric(any(str_detect(c_across(2:last_col()), "7")))) # change the 2:last_col() to select your column range ex: 2:5
Output for finding "7" in any of the columns:
C1 C2 C3 new_var
<chr> <dbl> <dbl> <dbl>
1 A 1 5 0
2 B 7 6 1
3 C 3 7 1
4 D 4 8 0

Selecting range in a data.frame containing of factors

If i have a data.frame like this, but much bigger
> df
# df
# 1 G0100
# 2 G0546
# 3 G1573
# 4 G1748
# 5 G2214
# 6 G2473
# 7 G2764
# 8 G3421
# 9 G5748
# 10 G8943
is there a beautiful way to select a range between G1500 and G2500 in much bigger data set?
We can use parse_number with between
library(dplyr)
library(readr)
df %>%
filter(between(parse_number(df), 1500, 2500))
A data.table option
> setDT(df)[, .SD[between(as.numeric(gsub("\\D", "", df)), 1500, 2500)]]
df
1: G1573
2: G1748
3: G2214
4: G2473
It is not really clear from the question what the general case is but we provide a variety of solutions based on different assumptions.
1) Assuming
the input shown reproducibly in the Note at the end
the lower and upper bounds are both 5 characters, as in the question
then use subset as shown. If all values in the data frame are 5 characters the first condition could be omitted.
subset(df, nchar(df) == 5 & df >= "G1500" & df <= "G2500")
giving:
df
3 G1573
4 G1748
5 G2214
6 G2473
2) Another possibility which relaxes the second assumption above is the following which gives the same output as above. The second argument of strapply is a function given in formula notation. x is the first argument corresponding to the first capture group and y is the second argument corresponding to the second capture group.
library(gsubfn)
subset(df, strapply(df, "(.)(.*)",
~ x=='G' & as.numeric(y) >= 1500 & as.numeric(y) <= 2500,
simplify = TRUE))
3) If every entry in the data frame begins with G or if we can ignore the letter then we could just omit it.
num <- as.numeric(sub("G", "", df$df))
subset(df, num >= 1500 & num <= 2500)
4) Another variation to read the first character and the rest into separate columns of a new data frame DF and then use subset:
DF <- read.table(text = sub("(.)", "\\1 ", df$df))
subset(df, DF$V1 == "G" & DF$V2 >= 1500 & DF$V2 <= 2500)
Note
Lines <- "
df
1 G0100
2 G0546
3 G1573
4 G1748
5 G2214
6 G2473
7 G2764
8 G3421
9 G5748
10 G8943"
df <- read.table(text = Lines)

Count events in range on a vector via iteration in R

I have a vector that contains sample numbers of event markers. They are only listed when there is an event found, not at every sample. I would like to obtain an output of the number of events found every second. Sampling rate is known (15hz).
I figured out how to do it with a for loop, but it is working a bit on the slow side. I am struggling to figure out a more efficient way to perform this calculation (with mapply or something like that maybe?). Does anybody have any suggestions?
Here is a sample of what I am doing:
vec <- c(9,20,23,48,50,51)
fs <- 15
start_idx <- seq(from=1,to=46,by=15)
end_idx <- seq(from=15,to=60,by=15)
counter <- vector()
for (i in 1:length(start_idx)) {
counter[i] <- length(which(vec >= start_idx[i] & vec <= end_idx[i]))
}
The results of counter should be:
> counter
[1] 1 2 0 3
Any help is much appreciated!
For a tidyverse approach, you can map inside mutate:
library(tidyverse)
ranges <- tibble(start_idx, end_idx)
ranges %>%
mutate(ct = map2_int(start_idx, end_idx, ~sum(.x <= vec & .y >= vec)))
start_idx end_idx ct
<dbl> <dbl> <int>
1 1 15 1
2 16 30 2
3 31 45 0
4 46 60 3
You can use findInterval/cut to find element in vec lies in which range and then use table to count frequency.
table(factor(findInterval(vec, start_idx), levels = seq_along(start_idx)))
#1 2 3 4
#1 2 0 3

How to create a subset by using another subset as condition?

I want to create a subset using another subset as a condition. I can't show my actual data, but I can show an example that deals with the core of my problem.
For example, I have 10 subjects with 10 observations each. So an example of my data would be to create a simple data frame using this:
ID <- rep(1:10, each = 10)
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(ID,x,y)
Which creates:
ID x y
1 1 0.08146318 0.26682668
2 1 -0.18236757 -1.01868755
3 1 -0.96322876 0.09565239
4 1 -0.64841436 0.09202456
5 1 -1.15244873 -0.38668929
6 1 0.28748521 -0.80816416
7 1 -0.64243912 0.69403155
8 1 0.84882350 -1.48618271
9 1 -1.56619331 -1.30379070
10 1 -0.29069417 1.47436411
11 2 -0.77974847 1.25704185
12 2 -1.54139896 1.25146126
13 2 -0.76082748 0.22607239
14 2 -0.07839719 1.94448322
15 2 -1.53020374 -2.08779769
etc.
Some of these subjects were positive for an event (for example subject 3, 5 and 7), so I have created a subset for that using:
event_pos <- subset(df, ID %in% c("3","5","7"))
Now, I also want to create a subset for the subjects who were negative for an event. I could use something like this:
event_neg <- subset(df, ID %in% c("1","2","4","6","8","9","10"))
The problem is, my data set is too large to specify all the individuals of the negative group. Is there a way to use my subset event_pos to get all the subjects with negative events in one subset?
TL;DR
Can I get a subset_2 by removing the subset_1 from the data frame?
You can use :
ind_list <- c("3","5","7")
event_neg <- subset(df, (ID %in% ind_list) == FALSE)
or
event_neg <- subset(df, !(ID %in% ind_list))
Hope that will helps
Gottaviannoni

How to find the largest range from a series of numbers using R?

I have a data set where length and age correspond with individual items (ID #), there are 4 different items, you can see on the data set below.
range(dataset$length)
gives me the overall range of the length for all items. But I need to compare ranges to determine which item (ID #) has the largest range in length relative to the other 3.
length age ID #
3.5 5 1
7 10 1
10 15 1
4 5 2
8 10 2
13 15 2
3 5 3
7 10 3
9 15 3
4 5 4
5 10 4
7 15 4
This gives you the differences in ranges:
lapply( with(dat, tapply(length, ID, range)), diff)
And you can wrap which.max around htat list to give you the ID associated with the largest value:
which.max( lapply( with(dat, tapply(length, ID, range)), diff) )
2
2
In base R:
mins <- tapply(df$length, df$ID, min)
maxs <- tapply(df$length, df$ID, max)
unique( df$ID)[which.max(maxs-mins)]
group_by in dplyr may be helpful:
library(dplyr)
dataset %>%
group_by(ID) %>%
summarize(ID_range = n())
The above code is equivalent to the following (it's just written with %>%):
library(dplyr)
dataset <- group_by(dataset, ID)
summarize(dataset, ID_range = n())
An easy approach which doesn't use dplyr, though perhaps less elegant, is the which function.
range(dataset$length[which(dat$id == 1)])
range(dataset$length[which(dat$id == 2)])
range(dataset$length[which(dat$id == 3)])
range(dataset$length[which(dat$id == 4)])
You could also make a function that gives you the actual range (the difference between the max and the means) and use lapply to show you the IDs paired with their ranges.
largest_range <- function(id){
rbind(id,
(max(data$length[which(data$id == id)]) -
min(data$length[which(data$id == id)])))
}
lapply(X = unique(data$id), FUN = largest_range)

Resources