I am trying to do some simulations in R and I am stuck on the loop that I need to be doing. I am able to get what I need in one iteration but trying to code the loop is throwing me off. This is what i am doing for one iteration.
Subjects <- c(1,2,3,4,5,6)
Group <- c('A','A','B','B','C','C')
Score <- rnorm(6,mean=5,sd=1)
Example <- data.frame(Subjects,Group,Score)
library(dplyr)
Score_by_Group <- Example %>% group_by(Group) %>% summarise(SumGroup = sum(Score))
Score_by_Group$Top_Group <- ifelse(Score_by_Group[,2] == max(Score_by_Group[,2]),1,0)
Group SumGroup Top_Group
1 A 8.77 0
2 B 6.22 0
3 C 9.38 1
What I need my loop to do is, run the above 'X' times and every time that group has the Top Score, add it to the previous result. So for example, if the loop was to be x=10, I would need a result like this:
Group Top_Group
1 A 3
2 B 5
3 C 2
If you don't mind forgoing the for loop, we can use replicate to repeat the code, then bind the output together, and then summarize.
library(tidyverse)
run_sim <- function()
{
Subjects <- c(1, 2, 3, 4, 5, 6)
Group <- c('A', 'A', 'B', 'B', 'C', 'C')
Score <- rnorm(6, mean = 5, sd = 1)
Example <- data.frame(Subjects, Group, Score)
Score_by_Group <- Example %>%
group_by(Group) %>%
summarise(SumGroup = sum(Score)) %>%
mutate(Top_Group = +(SumGroup == max(SumGroup))) %>%
select(-SumGroup)
}
results <- bind_rows(replicate(10, run_sim(), simplify = F)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group))
Output
Group Top_Group
<chr> <int>
1 A 3
2 B 3
3 C 4
I think this should work:
library(dplyr)
Subjects <- c(1,2,3,4,5,6)
Group <- c('A','A','B','B','C','C')
Groups <- c('A','B','C')
Top_Group <- c(0,0,0)
x <- 10
for(i in 1:x) {
Score <- rnorm(6,mean=5,sd=1)
Example <- data.frame(Subjects,Group,Score)
Score_by_Group <- Example %>% group_by(Group) %>% summarise(SumGroup = sum(Score))
Score_by_Group$Top_Group <- ifelse(Score_by_Group[,2] == max(Score_by_Group[,2]),1,0)
Top_Group <- Top_Group + Score_by_Group$Top_Group
}
tibble(Groups, Top_Group)
Related
Below are my two dataframe:
ABData1 <- data.frame(id=c(11,12,13,14,15),
a = c(1,2,3,4,5))
ABData2 <- data.frame(id=c(11,12,13,14),
b = c(1,4,3,4))
how to compare these two dataframe for matching rows and mismatch rows
if 1st row of ABData1 of a is matching with 1st row of ABData2 of b is matching then show as match and else show as mismatch and then goes to 2nd row....all the comparison will rowwise.
i have tried below code which is working fine for one data frame but its trowing error because of different rows in two data frames.
ABData <- data.frame(a = c(1,2,2,1,1),
b = c(1,2,1,1,2))
match<- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
mismatch<- ABData %>% rowwise() %>% filter(!grepl(a,b))
I am expecting below output
Expected match Output:
id a expected b
11 1 1 1
13 3 3 3
14 4 4 4
Expected mismatch output:
id a expected b
12 2 2 4
15 NA NA 5
Thanks in advance.
You can use this:
ABData1 <- data.frame(a = c(1,2,3,4,5))
ABData2 <- data.frame(b = c(1,4,3,4))
equLength <- function(x, y) {
if (length(x)>length(y)) length(y) <- length(x) else length(x) <- length(y)
data.frame(a=x, b=y)
}
ABData <- equLength(ABData1$a, ABData2$b)
... and then use your working code for one dataframe.
library("dplyr")
resultMatch <- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
resultMismatch <- ABData %>% rowwise() %>% filter(!grepl(a,b))
For the extended question:
library("dplyr")
ABData1 <- data.frame(id=c(11,12,13,14,15), a = c(1,2,3,4,5))
ABData2 <- data.frame(id=c(11,12,13,14), b = c(1,4,3,4))
equLength <- function(x, y) {
if (length(x)>length(y)) length(y) <- length(x) else length(x) <- length(y)
data.frame(a=x, b=y)
}
if (nrow(ABData1)>nrow(ABData2)) ABData <- data.frame(ABData1, b=equLength(ABData1$a, ABData2$b)$b) else
ABData <- data.frame(ABData2, a=equLength(ABData1$a, ABData2$b)$a)
resultMatch <- ABData %>% rowwise() %>% filter(grepl(a,b, fixed = TRUE))
resultMismatch <- ABData %>% rowwise() %>% filter(!grepl(a,b))
Having a dataframe like this
data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
How is it possible to take the sum of every row from the column num, and include the minuse into the calculation?
Example of expected output?
data.frame(id = c(1,2), sum = c(32, 30)
Using Base R you could do the following:
# data
df <- data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
# split by ",", convert to numeric and then sum
df[, 2] <- sapply(strsplit(as.character(df$num), ","), function(x){
sum(as.numeric(x))
})
# result
df
# id num
# 1 1 32
# 2 2 30
If you can use packages, the tidy packages make this easy and use tidy data principals which are quick and easy once you get used to thinking this way.
library(tidyr)
library(dplyr)
df %>%
# Convert the string of numbers to a tidy dataframe
# with one number per row with the id column for grouping
separate_rows(num,sep = ",") %>%
# Convert the text to a number so we can sum
mutate(num = as.numeric(num)) %>%
# Perform the calculation for each id
group_by(id) %>%
# Sum the number
summarise(sum = sum(num,na.rm = TRUE)) %>%
# Ungroup for further use of the data
ungroup()
# A tibble: 2 x 2
# id sum
# <dbl> <dbl>
# 1 1 32
# 2 2 30
library(stringr)
df <- data.frame(id = c(1,2), num = c("30, 4, -2","10, 20"))
df$sum <- NA
for (i in 1:nrow(df)) {
temp <- as.character(df[i,2])
n_num <- str_count(temp, '[0-9.]+')
total <- 0
for (j in 1:n_num) {
digit <- strsplit(temp, ',')[[1]][j]
total <- total + as.numeric(digit)
temp <- sub(digit, '', temp)
}
df[i, 'sum'] <- total
}
print(df)
id num sum
1 1 30, 4, -2 32
2 2 10, 20 30
I'm trying to apply a function over the rows of a data frame and return a value based on the value of each element in a column. I'd prefer to pass the whole dataframe instead of naming each variable as the actual code has many variables - this is a simple example.
I've tried purrr map_dbl and rowwise but can't get either to work. Any suggestions please?
#sample df
df <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31))
#required result
Res <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31),
NewVal=c(10,500,800,230,3100)
)
#use mutate and map or rowwise etc
Res <- df %>%
mutate(NewVal=map_dbl(.x=.,.f=FnAdd(.)))
Res <- df %>%
rowwise() %>%
mutate(NewVal=FnAdd(.))
#sample fn
FnAdd <- function(Data){
if(Data$Y=="A"){
X=Data$X*10
}
if(Data$Y=="B"){
X=Data$X*100
}
return(X)
}
If there are multiple values, it is better to have a key/val dataset, join and then do the mulitiplication
keyVal <- data.frame(Y = c("A", "B"), NewVal = c(10, 100))
df %>%
left_join(keyVal) %>%
mutate(NewVal = X*NewVal)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
It is not clear how many unique values are there in the actual dataset 'Y' column. If we have only a few values, then case_when can be used
FnAdd <- function(Data){
Data %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X *100,
TRUE ~ X))
}
FnAdd(df)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
You were originally looking for a solution using dplyr's rowwise() function, so here is that solution. The nice thing about this approach is that you don't need to create a separate function.
Here's the version using if()
df %>%
rowwise() %>%
mutate(NewVal = ifelse(Y == "A", X * 10,
ifelse(Y == "B", X * 100)))
and here's the version using case_when:
df %>%
rowwise() %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X * 100))
This question is similar to selecting the top N values within a group by column here.
However, I want to select the last N values by group, with N depending on the value of a corresponding count column. The count represents the number of occurrences of a specific name. If count >3, I only want the last three entries but if it is less than 3, I only want the last entry.
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))
# Obtain count for each name
count <- df %>%
group_by(Name) %>%
summarise(Count = n_distinct(Value))
# Merge dataframe with count
merge(df, count, by=c("Name"))
# Delete the first entry for x and the first entry for z
# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))
Another dplyrish way:
df %>% group_by(Name) %>% slice(tail(row_number(),
if (n_distinct(Value) < 3) 1 else 3
))
# A tibble: 7 x 2
# Groups: Name [3]
Name Value
<fctr> <dbl>
1 x 2
2 x 3
3 x 4
4 y 5
5 y 6
6 y 7
7 z 9
The analogue in data.table is...
library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]
The closest thing in base R is...
with(df, {
len = tapply(Value, Name, FUN = length)
nv = tapply(Value, Name, FUN = function(x) length(unique(x)))
df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})
... which is way more difficult to come up with than it should be.
Another possibility:
library(tidyverse)
df %>%
split(.$Name) %>%
map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))
Which gives:
# Name Value
#1 x 2
#2 x 3
#3 x 4
#4 y 5
#5 y 6
#6 y 7
#7 z 9
In base R, split the df by df$Name first. Then, for each subgroup, check number of rows and extract last 3 or last 1 row conditionally.
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))
Or
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
# Name Value
#x.2 x 2
#x.3 x 3
#x.4 x 4
#y.5 y 5
#y.6 y 6
#y.7 y 7
#z z 9
For three conditional values
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))
If you're already using dplyr, the natural approach is:
library(dplyr)
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"),
Value = c(1,2,3,4,5,6,7,8,9))
df %>%
group_by(Name) %>%
mutate(Count = n_distinct(Value),
Rank = dense_rank(desc(Value))) %>%
filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
select(-c(Count,Rank))
There's no need for a merge since you are just counting and ranking on groups defined by Name. Then, you apply a filter on your count and rank requirements, and (optionally, for clean-up) drop the counts and ranks.
Starting with data containing multiple observations for each group, like this:
set.seed(1)
my.df <- data.frame(
timepoint = rep(c(0, 1, 2), each= 3),
counts = round(rnorm(9, 50, 10), 0)
)
> my.df
timepoint counts
1 0 44
2 0 52
3 0 42
4 1 66
5 1 53
6 1 42
7 2 55
8 2 57
9 2 56
To perform a summary calculation at each timepoint relative to timepoint == 0, for each group I need to pass a vector of counts for timepoint == 0 and a vector of counts for the group (e.g. timepoint == 0) to an arbitrary function, e.g.
NonsenseFunction <- function(x, y){
(mean(x) - mean(y)) / (1 - mean(y))
}
I can get the required output from this table, either with dplyr:
library(dplyr)
my.df %>%
group_by(timepoint) %>%
mutate(rep = paste0("r", 1:n())) %>%
left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
group_by(timepoint.x) %>%
summarise(result = NonsenseFunction(counts.x, counts.y))
or data.table:
library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
, NonsenseFunction(counts.x, counts.y), by = timepoint.x]
This only works if the number of observations between groups is the same. Anyway, the observations aren't matched, so using the temporary rep variable seems hacky.
For a more general case, where I need to pass vectors of the baseline values and the group's values to an arbitrary (more complicated) function, is there an idiomatic data.table or dplyr way of doing so with a grouped operation for all groups?
Here's the straightforward data.table approach:
my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]
This probably grabs my.dt[timepoint==0, counts] again and again, for each group. You could instead save that value ahead of time:
v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]
... or if you don't want to add v to the environment, maybe
with(list(v = my.dt[timepoint==0, counts]),
my.dt[, f(counts, v), by=timepoint]
)
You could give the second argument to use the vector from your group of interest as a constant.
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))
Or if you want to make it beforehand:
constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, constant))
You can try,
library(dplyr)
my.df %>%
mutate(new = mean(counts[timepoint == 0])) %>%
group_by(timepoint) %>%
summarise(result = NonsenseFunction(counts, new))
# A tibble: 3 × 2
# timepoint result
# <dbl> <dbl>
#1 0 0.0000000
#2 1 0.1398601
#3 2 0.2097902