Having a dataframe like this
data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
How is it possible to take the sum of every row from the column num, and include the minuse into the calculation?
Example of expected output?
data.frame(id = c(1,2), sum = c(32, 30)
Using Base R you could do the following:
# data
df <- data.frame(id = c(1,2), num = c("30, 4, -2,","10, 20"))
# split by ",", convert to numeric and then sum
df[, 2] <- sapply(strsplit(as.character(df$num), ","), function(x){
sum(as.numeric(x))
})
# result
df
# id num
# 1 1 32
# 2 2 30
If you can use packages, the tidy packages make this easy and use tidy data principals which are quick and easy once you get used to thinking this way.
library(tidyr)
library(dplyr)
df %>%
# Convert the string of numbers to a tidy dataframe
# with one number per row with the id column for grouping
separate_rows(num,sep = ",") %>%
# Convert the text to a number so we can sum
mutate(num = as.numeric(num)) %>%
# Perform the calculation for each id
group_by(id) %>%
# Sum the number
summarise(sum = sum(num,na.rm = TRUE)) %>%
# Ungroup for further use of the data
ungroup()
# A tibble: 2 x 2
# id sum
# <dbl> <dbl>
# 1 1 32
# 2 2 30
library(stringr)
df <- data.frame(id = c(1,2), num = c("30, 4, -2","10, 20"))
df$sum <- NA
for (i in 1:nrow(df)) {
temp <- as.character(df[i,2])
n_num <- str_count(temp, '[0-9.]+')
total <- 0
for (j in 1:n_num) {
digit <- strsplit(temp, ',')[[1]][j]
total <- total + as.numeric(digit)
temp <- sub(digit, '', temp)
}
df[i, 'sum'] <- total
}
print(df)
id num sum
1 1 30, 4, -2 32
2 2 10, 20 30
Related
I am trying to do some simulations in R and I am stuck on the loop that I need to be doing. I am able to get what I need in one iteration but trying to code the loop is throwing me off. This is what i am doing for one iteration.
Subjects <- c(1,2,3,4,5,6)
Group <- c('A','A','B','B','C','C')
Score <- rnorm(6,mean=5,sd=1)
Example <- data.frame(Subjects,Group,Score)
library(dplyr)
Score_by_Group <- Example %>% group_by(Group) %>% summarise(SumGroup = sum(Score))
Score_by_Group$Top_Group <- ifelse(Score_by_Group[,2] == max(Score_by_Group[,2]),1,0)
Group SumGroup Top_Group
1 A 8.77 0
2 B 6.22 0
3 C 9.38 1
What I need my loop to do is, run the above 'X' times and every time that group has the Top Score, add it to the previous result. So for example, if the loop was to be x=10, I would need a result like this:
Group Top_Group
1 A 3
2 B 5
3 C 2
If you don't mind forgoing the for loop, we can use replicate to repeat the code, then bind the output together, and then summarize.
library(tidyverse)
run_sim <- function()
{
Subjects <- c(1, 2, 3, 4, 5, 6)
Group <- c('A', 'A', 'B', 'B', 'C', 'C')
Score <- rnorm(6, mean = 5, sd = 1)
Example <- data.frame(Subjects, Group, Score)
Score_by_Group <- Example %>%
group_by(Group) %>%
summarise(SumGroup = sum(Score)) %>%
mutate(Top_Group = +(SumGroup == max(SumGroup))) %>%
select(-SumGroup)
}
results <- bind_rows(replicate(10, run_sim(), simplify = F)) %>%
group_by(Group) %>%
summarise(Top_Group = sum(Top_Group))
Output
Group Top_Group
<chr> <int>
1 A 3
2 B 3
3 C 4
I think this should work:
library(dplyr)
Subjects <- c(1,2,3,4,5,6)
Group <- c('A','A','B','B','C','C')
Groups <- c('A','B','C')
Top_Group <- c(0,0,0)
x <- 10
for(i in 1:x) {
Score <- rnorm(6,mean=5,sd=1)
Example <- data.frame(Subjects,Group,Score)
Score_by_Group <- Example %>% group_by(Group) %>% summarise(SumGroup = sum(Score))
Score_by_Group$Top_Group <- ifelse(Score_by_Group[,2] == max(Score_by_Group[,2]),1,0)
Top_Group <- Top_Group + Score_by_Group$Top_Group
}
tibble(Groups, Top_Group)
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?
I have the following data:
set.seed(26312)
id <- rep(c(1, 2, 3, 4, 5), each = 9)
wrc <- round(runif(36, 20, 100))
wrc <- c(wrc, wrc[10:18])
x <- rep(1:9, 5)
dat <- data.frame(id, wrc, x)
In this data set, id 2 and id 5 contain the exact same data but with different IDs. This can be verified by running,
dat[dat$id == 2, ]
dat[dat$id == 5, ]
I have a much larger data set, with 4321 IDs, and I want to remove these duplicates because even though they have different IDs, they really are duplicates.
Presently I am do a combo of really awful and extremely slow for() and while() loops. In English, what the code is doing is subsetting an id and then comparing that id to every other id that I have subsetted within a while loop. When I find a duplicate, meaning all the rows of data are identical, it should throw away the first id that is a duplicate. The resulting cleaned_data is what I want, it is just unbearable slow to get there. Because it takes roughly 1 minute to do a comparison when I have 4321 ids, so that's about 4321 minutes to run this awful loop. Can someone help?
library("dplyr")
id_check = 1:5
cleaned_data <- data.frame()
for(i in id_check){
compare_tmp <- dat %>% filter(id == i)
compare_check <- compare_tmp %>% select(wrc, x)
duplicate = FALSE
if(i == length(id_check)){
cleaned_data <- rbind(cleaned_data, compare_tmp)
break
} else {
id_tmp = i + 1
}
while(duplicate == FALSE){
check <- dat %>% filter(id == id_tmp) %>% select(wrc, x)
if(nrow(check) == 0) break
duplicate = identical(compare_check, check)
id_tmp = id_tmp + 1
if(id_tmp == (length(id_check) + 1)) {
break
}
}
if(duplicate == FALSE){
cleaned_data <- rbind(cleaned_data, compare_tmp)
}
}
cleaned_data
This is in response to why duplicated won't work. Below ids 2 and 5 are not the same subjects because there data aren't always identical.
set.seed(26312)
id <- rep(c(1, 2, 3, 4, 5), each = 9)
wrc <- round(runif(36, 20, 100))
wrc <- c(wrc, wrc[c(1, 11:18)])
x <- rep(1:9, 5)
dat <- data.frame(id, wrc, x)
dat[dat$id == 2,]
dat[dat$id == 5,]
If I run dat[!duplicated(dat[2:3]),] it removes id 5, when it shouldn't.
If the column structure is accurate, you could convert to wide format for duplicate detection:
dat_wide = reshape2::dcast(dat, id ~ x, value.var = "wrc")
dupes = dat_wide$id[duplicated(dat_wide[-1], fromLast = T)]
no_dupes = dat[!dat$id %in% dupes, ]
Maybe something along the lines of:
do.call(
rbind,
split(dat, dat$id)[!duplicated(lapply(split(dat[2:3], dat$id), `rownames<-`, NULL), fromLast = TRUE)]
)
This splits by id, identifies duplicates, then binds again the non-duplicates.
Edit
Since time is of the essence here, I ran a benchmark of the solutions so far:
set.seed(26312)
p <- 4321
id <- rep(1:p, each = 9)
dats <- replicate(p %/% 2, round(runif(9, 20, 100)), simplify = FALSE)
wrc <- unlist(sample(dats, p, replace = TRUE))
x <- rep(1:9, times = p)
dat <- data.frame(id, wrc, x)
microbenchmark::microbenchmark(
base = {
do.call(
rbind,
split(dat, dat$id)[!duplicated(lapply(split(dat[2:3], dat$id), `rownames<-`, NULL), fromLast = TRUE)]
)
},
tidyr = {
as_tibble(dat) %>%
nest(-id) %>%
filter(!duplicated(data, fromLast = TRUE)) %>%
unnest()
},
reshape = {
dat_wide = reshape2::dcast(dat, id ~ x, value.var = "wrc")
dupes = dat_wide$id[duplicated(dat_wide[-1], fromLast = T)]
no_dupes = dat[!dat$id %in% dupes, ]
},
times = 10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# base 892.8239 980.36553 1090.87505 1096.12514 1187.98810 1232.47244 10 c
# tidyr 944.8156 953.10558 977.71756 976.83703 990.58672 1033.27664 10 b
# reshape 49.9955 50.13347 52.20539 51.91833 53.91568 55.64506 10 a
With tidyr:
library(tidyr)
library(dplyr)
as_tibble(dat) %>%
nest(-id) %>%
filter(!duplicated(data, fromLast = TRUE)) %>%
unnest()
# # A tibble: 36 x 3
# id wrc x
# <dbl> <dbl> <int>
# 1 1 53 1
# 2 1 44 2
# 3 1 70 3
# 4 1 31 4
# 5 1 67 5
# 6 1 50 6
# 7 1 70 7
# 8 1 40 8
# 9 1 52 9
# 10 3 95 1
# # ... with 26 more rows
(Note: not sure about the Stackoverflow policy about multiple answers, but this one is different enough to deserve a separate answer IMHO (if it's not, please say so and I'll edit my initial answer and delete this one).
This question is similar to selecting the top N values within a group by column here.
However, I want to select the last N values by group, with N depending on the value of a corresponding count column. The count represents the number of occurrences of a specific name. If count >3, I only want the last three entries but if it is less than 3, I only want the last entry.
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))
# Obtain count for each name
count <- df %>%
group_by(Name) %>%
summarise(Count = n_distinct(Value))
# Merge dataframe with count
merge(df, count, by=c("Name"))
# Delete the first entry for x and the first entry for z
# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))
Another dplyrish way:
df %>% group_by(Name) %>% slice(tail(row_number(),
if (n_distinct(Value) < 3) 1 else 3
))
# A tibble: 7 x 2
# Groups: Name [3]
Name Value
<fctr> <dbl>
1 x 2
2 x 3
3 x 4
4 y 5
5 y 6
6 y 7
7 z 9
The analogue in data.table is...
library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]
The closest thing in base R is...
with(df, {
len = tapply(Value, Name, FUN = length)
nv = tapply(Value, Name, FUN = function(x) length(unique(x)))
df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})
... which is way more difficult to come up with than it should be.
Another possibility:
library(tidyverse)
df %>%
split(.$Name) %>%
map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))
Which gives:
# Name Value
#1 x 2
#2 x 3
#3 x 4
#4 y 5
#5 y 6
#6 y 7
#7 z 9
In base R, split the df by df$Name first. Then, for each subgroup, check number of rows and extract last 3 or last 1 row conditionally.
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))
Or
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
# Name Value
#x.2 x 2
#x.3 x 3
#x.4 x 4
#y.5 y 5
#y.6 y 6
#y.7 y 7
#z z 9
For three conditional values
do.call(rbind, lapply(split(df, df$Name), function(a)
a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))
If you're already using dplyr, the natural approach is:
library(dplyr)
# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"),
Value = c(1,2,3,4,5,6,7,8,9))
df %>%
group_by(Name) %>%
mutate(Count = n_distinct(Value),
Rank = dense_rank(desc(Value))) %>%
filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
select(-c(Count,Rank))
There's no need for a merge since you are just counting and ranking on groups defined by Name. Then, you apply a filter on your count and rank requirements, and (optionally, for clean-up) drop the counts and ranks.