I have a data frame like
V1 V2 V3
1 1 1 2
2 0 1 0
3 3 0 3
....
and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)
[1] 2 2 1...
From those I can get the colMeans for each cluster, like
cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])
(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)
What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:
1 2 1...
because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.
If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,
lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1))
aggregate(values ~ ind, dat, FUN = which.max)
If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head
library(dplyr)
library(tidyr)
df %>%
mutate(cluster = fit$cluster) %>%
pivot_longer(cols = -cluster) %>%
group_by(cluster, name) %>%
summarise(value = mean(value), .groups = 'drop') %>%
arrange(name, desc(value)) %>%
group_by(name) %>%
slice_head(n = 2)
data
df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L,
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))
fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame",
row.names = c(NA,
-3L))
I have made a tibble with three columns. Cancer status (case/control), trait A (x1) and trait B (x2).
cancer <- tibble(case = factor(c(rep(1,3),rep(0,3)),
levels = c(0,1),
labels = c("control","case")),
x1 = c(1,2,3,4,2,3),
x2 = c(4,8,6,0,2,4))
Now I want to make a matrix with the means of x1 and x2, but just for the patients with cancer (case). So a 2x1 matrix. How can I do this? I tried to start with those two functions, but I can't get them to work...
matrix <- filter(cancer$case == "case")
mean(cancer$x1 & cancer$case == "case")
I know this is a very simple example and I could do it by hand, but I'm new to R and want to know how I could handle this with 6000 instead of 6 rows.
Like this:
library(dplyr)
cancer %>%
summarise(meanx1 = mean(x1[case == "case"]),
meanx2 = mean(x2[case == "case"]))
# A tibble: 1 x 2
meanx1 meanx2
<dbl> <dbl>
1 2 6
If you want it as a matrix:
cancer %>%
summarise(meanx1 = mean(x1[case == "case"]),
meanx2 = mean(x2[case == "case"])) %>%
as.matrix()
meanx1 meanx2
[1,] 2 6
And with across if you have updated to dplyr 1.0.0:
cancer %>%
summarise(across(x1:x2, ~mean(x1[case == "case"]), .names = "mean_{col}")) %>%
as.matrix()
mean_x1 mean_x2
[1,] 2 2
With the new dplyr 1.0.0, you can do
library(tidyverse)
cancer <- tibble(case = factor(c(rep(1,3),rep(0,3)),
levels = c(0,1),
labels = c("control","case")),
x1 = c(1,2,3,4,2,3),
x2 = c(4,8,6,0,2,4))
cancer %>%
filter(case == "case") %>%
summarize(across(x1:x2, mean))
I have a 24 datasets that each have one factor and one response. I have written code to subset the 93 entries into 3 categories, but I'm not sure what the most efficient way there is to run this code for all 24 of my datasets. Any ideas would be much appreciated.
Here's the data I'm working with.
dput(head(data))
structure(list(run.size.percentage = structure(c(2L, 13L, 24L,
35L, 46L, 57L), .Label = c(",2000,", "1,0.375,0.013", "10,0.868,0.11",
"11,0.953,0.12", "12,1.047,0.12", "13,1.149,0.13", "14,1.261,0.14",
"15,1.385,0.14", "16,1.520,0.15", "17,1.668,0.15", "18,1.832,0.16",
"19,2.011,0.17", "2,0.412,0.023", "20,2.207,0.17", "21,2.423,0.18",
"22,2.660,0.19", "23,2.920,0.20", "24,3.205,0.21", "25,3.519,0.22",
"26,3.863,0.24", "27,4.240,0.25", "28,4.655,0.26", "29,5.110,0.28",
"3,0.452,0.034", "30,5.610,0.30", "31,6.158,0.31", "32,6.760,0.33",
"33,7.421,0.35", "34,8.147,0.37", "35,8.943,0.39", "36,9.817,0.42",
"37,10.78,0.45", "38,11.83,0.47", "39,12.99,0.50", "4,0.496,0.049",
"40,14.26,0.53", "41,15.65,0.56", "42,17.18,0.58", "43,18.86,0.59",
"44,20.70,0.59", "45,22.73,0.58", "46,24.95,0.55", "47,27.39,0.52",
"48,30.07,0.49", "49,33.01,0.46", "5,0.545,0.061", "50,36.24,0.45",
"51,39.78,0.45", "52,43.67,0.45", "53,47.94,0.44", "54,52.62,0.42",
"55,57.77,0.38", "56,63.41,0.35", "57,69.61,0.32", "58,76.42,0.31",
"59,83.89,0.33", "6,0.598,0.072", "60,92.09,0.36", "61,101.1,0.42",
"62,111.0,0.49", "63,121.8,0.59", "64,133.7,0.74", "65,146.8,0.94",
"66,161.2,1.19", "67,176.9,1.49", "68,194.2,1.82", "69,213.2,2.18",
"7,0.656,0.083", "70,234.1,2.55", "71,256.9,2.94", "72,282.1,3.34",
"73,309.6,3.78", "74,339.9,4.25", "75,373.1,4.73", "76,409.6,5.20",
"77,449.7,5.60", "78,493.6,5.87", "79,541.9,5.93", "8,0.721,0.093",
"80,594.9,5.77", "81,653.0,5.37", "82,716.8,4.77", "83,786.9,4.03",
"84,863.9,3.21", "85,948.3,2.36", "86,1041,1.55", "87,1143,0.81",
"88,1255,0.30", "89,1377,0.056", "9,0.791,0.10", "90,1512,0.0044",
"91,1660,0", "92,1822,0"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Here's the code that worked for each dataset.
data2 <- tidyr::separate(names(data), unlist(strsplit(names(data), "\\.")), ",", data=data)
group1 <- data2 %>% filter(size <= 2)
group2 <- data2 %>% filter(size > 2 & size <= 50)
group3 <- data2 %>% filter(size > 50 & size <= 2000)
sum(as.numeric(group1$percentage), na.rm=TRUE)
sum(as.numeric(group2$percentage), na.rm=TRUE)
sum(as.numeric(group3$percentage), na.rm=TRUE)
Put your dataframes in a list and use lapply. Used cut to create the needed size groups. Also added convert = TRUE arg to separate to convert numbers into numeric -
df_list <- list(df, df) # creating a dummy list with same df
lapply(df_list, function(x) {
separate(names(df), unlist(strsplit(names(df), "\\.")), ",",
data = df, convert = TRUE) %>%
group_by(group = cut(size, breaks = c(0,2,50,2000,Inf))) %>%
summarise(percentage = sum(percentage))
})
# every list element is your desired output df
[[1]]
# A tibble: 1 x 2
group percentage
<fct> <dbl>
1 (0,2] 0.252
[[2]]
# A tibble: 1 x 2
group percentage
<fct> <dbl>
1 (0,2] 0.252
I'm quite new to R and I'm facing a problem which I guess is quite easy to fix but I couldn't find the answer.
I have a dataframe called clg where basically I have 3 columns date, X1, X2.
X1 and X2 are name of country teams. X1 and X2 have the same list of countries.
I'm simply trying to count the frequency of each country in the two columns as a total.
So far, I've only been able to count the frequency of the X1 column but I didn't find a way to sum both columns.
clt <- as_tibble(na.omit(count(clg, clg$X1)))
I would like to get a data frame where in the first columns I have unique countries, and in the second column the sum of occurrences in X1 + X2.
You can useunlist() and table() to get the overall counts. Wrapping it in data.frame() will give you the desired two column output.
clg <- data.frame(date=1:3,
X1=c("nor", "swe", "alg"),
X2=c("swe", "alg", "jpn"))
data.frame(table(unlist(clg[c("X1", "X2")])))
# Var1 Freq
# 1 alg 2
# 2 nor 1
# 3 swe 2
# 4 jpn 1
With tidyverse, we can gather into 'long' format and then do the count
library(tidyverse)
gather(clg, key, Var1, -date) %>%
count(Var1)
# A tibble: 4 x 2
# Var1 n
# <chr> <int>
#1 alg 2
#2 jpn 1
#3 nor 1
#4 swe 2
data
clg <- structure(list(date = 1:3, X1 = structure(c(2L, 3L, 1L), .Label = c("alg",
"nor", "swe"), class = "factor"), X2 = structure(c(3L, 1L, 2L
), .Label = c("alg", "jpn", "swe"), class = "factor")),
class = "data.frame", row.names = c(NA,
-3L))
You can obtain your goal with two steps. In the first step, you calculate the sum of occurrences for each country. In the next step, you're joining the two df's together and calculate the total sum.
X1_sum <- df %>%
dplyr::group_by(X1) %>%
dplyr::summarize(n_x1 = n())
X2_sum <- df %>%
dplyr::group_by(X2) %>%
dplyr::summarize(n_x2 = n()
final_summary <- X1_sum %>%
# merging data with by country names
dplyr::left_join(., X2_sum, by = c("X1", "X2")) %>%
dplyr::mutate(n_sum = n_x1 + n_x2)
I have data for which I want to summarize group means. I then would like to re-group some of the smaller groups (matching a certain n < x condition) into a group called "others". I found a way to do this. But it feels like there are more efficient solutions out there. I wonder how a data.table approach would solve the problem.
Here is an example using tibble and dyplr.
# preps
library(tibble)
library(dplyr)
set.seed(7)
# generate 4 groups with more observations
tbl_1 <- tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE))
# generate 3 groups with less observations
tbl_2 <- tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
# put them into one data frame
tbl <- rbind(tbl_1, tbl_2)
# aggregate the mean scores and count the observations for each group
tbl_agg1 <- tbl %>%
group_by(group) %>%
summarize(MeanScore = mean(score),
n = n())
So far so easy.
Next I want to only show groups with more than 100 observations. All other groups should be merged into one group called "others".
# First, calculate summary stats for groups less then n < 100
tbl_agg2 <- tbl_agg1 %>%
filter(n<100) %>%
summarize(MeanScore = weighted.mean(MeanScore, n),
sumN = sum(n))
Note: There was a mistake in the calculation above which is now corrected (#Frank: thanks for spotting it!)
# Second, delete groups less then n < 100 from the aggregate table and add a row containing the summary statistics calculated above instead
tbl_agg1 <- tbl_agg1 %>%
filter(n>100) %>%
add_row(group = "others", MeanScore = tbl_agg2[["MeanScore"]], n = tbl_agg2[["sumN"]])
tbl_agg1 basically shows what I want it to show, but I wonder if there is a smoother, more efficient way to do this. At the same time I wonder how a data.table approach would deal with the problem at hand.
I welcome any suggestions.
Your calculation for the "other" group is wrong, I guess... should be...
tbl_agg1 %>% {bind_rows(
filter(., n>100),
filter(., n<100) %>%
summarize(group = "other", MeanScore = weighted.mean(MeanScore, n), n = sum(n))
)}
However, you could keep things a lot simpler from the start by using a different grouping variable:
tbl %>%
group_by(group) %>%
group_by(g = replace(group, n() < 100, "other")) %>%
summarise(n = n(), m = mean(score))
# A tibble: 5 x 3
g n m
<chr> <int> <dbl>
1 a 136 4.79
2 b 188 4.49
3 c 160 5.32
4 d 116 4.78
5 other 150 5.42
Or with data.table
library(data.table)
DT = data.table(tbl)
DT[, n := .N, by=group]
DT[, .(.N, m = mean(score)), keyby=.(g = replace(group, n < 100, "other"))]
g N m
1: a 136 4.786765
2: b 188 4.489362
3: c 160 5.325000
4: d 116 4.784483
5: other 150 5.420000