Summing values when merging rows in a data set in R - r

So I have a large data set (50,000 rows and 500 columns). I merged the rows I wanted to by this code:
Similarities <- Home %>%
group_by_at(c(1,2,5,9,70,26)) %>%
summarize_all(.funs = function(x) paste(unique(x), collapse = ','))
In this code, for other the other rows that are combined and have different values, their output turns into a list separated with commas. However, now I want to sum all the values in one specific column, in which I tried this code:
Similarities <- Home %>%
group_by_at(c(1,2,5,9,70,26)) %>%
summarize_at(.vars = FTR, .funs = function(x) paste(sum(x))),
summarize_all(.funs = function(x) paste(unique(x), collapse = ','))
I assumed it wouldn't work because I wasn't sure what I was doing.
My goal is to have the specific column: "FTR", when I merge rows together, all the values in "FTR" be added together.
An example of the data would be:
Total Type Clm FTR Loss
300 water 2 -103 N
200 fire 3 203 Y
300 water 2 100 Y
What my code does now is:
Total Type CLM FTR Loss
300 water 2 -103, 100 Y, N
200 fire 3 203 Y
But what I want is:
Total Type CLM FTR Loss
300 water 2 -3 Y, N
200 fire 3 203 Y

The following code sums the collapsed columns, like the question asks for.
special_sum <- function(x, sep = ", ", na.rm = TRUE){
f <- function(y, na.rm){
y <- as.numeric(y)
sum(y, na.rm = na.rm)
}
x <- as.character(x)
x <- strsplit(x, sep)
sapply(x, f, na.rm = na.rm)
}
WIth the second data.frame posted in the question, the function special_sum could be called as follows. The group columns are for tests purposes only.
Home <- read.table(text = "
Total Type CLM FTR Loss
300 water 2 '-103, 100' 'Y, N'
200 fire 3 203 Y
", header = TRUE)
Home %>%
group_by(1, 2) %>%
summarize_at(vars('FTR'), special_sum)
## A tibble: 2 x 3
## Groups: 1, 2 [1]
# `1` `2` FTR
# <dbl> <dbl> <dbl>
#1 1 2 -3
#2 1 2 203
Note that you should probably sum first then paste the values.

Related

Selecting 10 names based on 10 highest numbers of other column

I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808

Using custom function to apply across multiple groups and subsets

I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)

R: Summing of the column values by ranged values of another column

Good day!
I’ve got a table of two columns. In the first column (x) there are values which I want to divide in into categories according to the specified range of values (in my instance – 300). And then using these categories I want to sum values in anther column (v). For instance, using my test data: The first category is from 65100 to 65400 (65100
The result: there is a table of two columns. The first one is the categories of x; the second column is the sum of according values of v.
Thank you!!!
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(x=c(x), v=c(v))
attach(tabl)
#categories
seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
I understood you want to:
Cut vector x,
Using pre-calculated cut-off thresholds
Compute sums over vector v using those groupings
This is one line of code with data.table and chaining. Your data are in data.table named DT.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Explanation:
First, assign cut-offs to variable breaks like so.
breaks <- seq(((min(x) - min(x) %% 300) + 300), ((max(x) - max(x) %% 300) + 300), by =300)
Second, compute a new column CUT to group rows by the data in breaks.
DT[, CUT := cut(x, breaks)]
Third, sum on column v in groups, using by=. I have chained this operation with the previous.
DT[, CUT := cut(x, breaks)][, sum(v), by=CUT]
Convert your data.frame to data.table like so.
library(data.table)
DT <- as.data.table(tabl)
This is the final result:
CUT V1
1: (6.57e+04,6.6e+04] 45493
2: (6.6e+04,6.63e+04] 77865
3: (6.66e+04,6.69e+04] 22893
4: (6.75e+04,6.78e+04] 61738
5: (6.54e+04,6.57e+04] 44805
6: (6.69e+04,6.72e+04] 64079
7: NA 33234
8: (6.72e+04,6.75e+04] 66517
9: (6.63e+04,6.66e+04] 43887
10: (6.78e+04,6.81e+04] 172
You can dress this up to improve aesthetics. For example, you can reset the factor levels for ease of reading.
When I use dplyr I am used to do it like this. Although I like the cut solution too.
# data
set.seed(1)
x <- sample(seq(65100, 67900, by=5), 100, replace = TRUE)
v <- sample(seq(1000, 8000), 100, replace = TRUE)
tabl <- data.frame(group=c(x), value=c(v))
attach(tabl)
#categories
s <- seq(((min(x) - min(x)%%300) + 300), ((max(x) - max(x)%%300) + 300), by =300)
tabl %>% rowwise() %>% mutate(g = s[min(which(group < s), na.rm=T)]) %>% ungroup() %>%
group_by(g) %>% summarise(sumvalue = sum(value))
result:
g sumvalue
<dbl> <int>
65400 28552
65700 49487
66000 45493
66300 77865
66600 43887
66900 21187
67200 65785
67500 66517
67800 61738
68100 1722
Try this (no package needed):
s <- seq(65100, max(tabl$x)+300, 300)
tabl$col = as.vector(cut(tabl$x, breaks = s, labels = 1:10))
df <- aggregate(v~col, tabl, sum)
# col v
# 1 1 33234
# 2 2 44805
# 3 3 45493
# 4 4 77865
# 5 5 43887
# 6 6 22893
# 7 7 64079
# 8 8 66517
# 9 9 61738
# 10 10 1722

R - How to change values with decimals into a different form

I have a variable that takes values of the form x.1, x.2 or x.3 currently with x being any number followed by the decimal point.
I would like to convert x.1 to x.333, x.2 to x.666 and x.3 to x.999 or in this case I would assume it would be rounded up to the whole number.
Context: running regression analysis containing a variable of innings pitched (baseball pitchers) which currently have data values of the .1, .2, .3 form above.
Help would be much appreciated!
You can use x %% 1 to get the fractional part of a number in R. Then just multiply that by 3.333 and add the result back on to the integer part of your number to get total innings pitched.
x <- 2.3
as.integer(x) + (x %% 1 * 3.333)
[1] 2.9999
(Use 3.333 instead of 0.333 to move the decimal.)
Depending on the exact context, it could be nice to keep the component parts -- if that's the case, I would be a little verbose and utilize tidyr and dplyr:
library(tidyr)
library(dplyr)
vec <- c("123.1", "456.2", "789.3")
df <- data.frame(vec)
df %>%
separate(vec, into = c("before_dot", "after_dot"), remove = FALSE, convert = TRUE) %>%
mutate(after_dot_times_333 = after_dot * 333,
new_var = paste(before_dot, after_dot_times_333, sep = "."))
# vec before_dot after_dot after_dot_times_333 new_var
# 1 123.1 123 1 333 123.333
# 2 456.2 456 2 666 456.666
# 3 789.3 789 3 999 789.999
Alternatively, you could accomplish this in one line:
sapply(strsplit(vec, "\\."), function(x) paste(x[1], as.numeric(x[2]) * 333, sep = "."))

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Resources