I have a tibble with several columns, including an ID column and a "score" column. The ID column has some duplicated values. I want to create a tibble that has one row per unique ID, and the same number of columns as the original tibble. For any ID, the "score" value in this new tibble should be the mean of the scores for the ID in the original tibble. And for any ID, the value for the other columns should be the first value that appears for that ID in the original tibble.
When the number of columns in the original tibble is small and known, this is an easy problem. Example:
scores <- tibble(
ID = c(1, 1, 2, 2, 3),
score = 1:5,
a = 6:10)
scores %>%
group_by(ID) %>%
summarize(score = mean(score), a = first(a))
But I often work with tibbles (or data frames) that have dozens of columns. I don't know in advance how many columns there will be or how they will be named. In these cases, I still want a function that takes, within each group, the mean of the score column and the first value of the other columns. But it isn't practical to spell out the name of each column. Is there a generic command that will let me summarize() by taking the mean of one column and the first value of all of the others?
A two-step solution would start by using mutate() to replace each score within a group with the mean of those scores. Then I could create my desired tibble by taking the first row of each group. But is there a one-step solution, perhaps using one of the select_helpers in dplyr?
Summarizing unknown number of column in R using dplyr is the closest post that I've been able to find. But I can't see that it quite speaks to this problem.
You can use mutate to get the mean values and then use slice to get the first row of each group, i.e.
library(dplyr)
scores %>%
group_by(ID) %>%
mutate(score = mean(score)) %>%
slice(1L)
#Source: local data frame [3 x 3]
#Groups: ID [3]
# ID score a
# <dbl> <dbl> <int>
#1 1 1.5 6
#2 2 3.5 8
#3 3 5.0 10
Related
In R, I have a data frame and a vector. The data frame has a column of dates (e.g. column A). The vector also contains dates. The dates are not necessarily continuous (i.e. a few consecutive dates may be 1/4/23, 1/17/23, 2/4/23, etc.) for either column A or the vector.
I want to create a new column in the data frame (column B) which is equal to (for each row) the minimum value of the vector that is greater than the date in column A. Perhaps a more general way of putting it, I want to create a new data frame column based on an existing column compared to a vector.
I have figured out how to do this using a function/loop, but it is not the cleanest. Is there a simpler way to do this without a loop? A dplyr solution would be ideal, as that is what I mostly use elsewhere in my code, but any help would be much appreciated. It would also be helpful to know if this is not possible without a loop. Thanks!
Using a rowwise mutate in dplyr, subset the vector to elements >= your date column, sort, and take the first element:
library(dplyr)
# example data
dat <- data.frame(
columnA = as.Date(c("2023-01-04", "2023-01-17", "2023-02-04"))
)
vec <- as.Date(c("2023-01-01", "2023-03-01", "2023-01-04", "2023-01-30"))
dat %>%
rowwise() %>%
mutate(columnB = first(sort(vec[vec >= columnA]))) %>%
ungroup()
# A tibble: 3 × 2
columnA columnB
<date> <date>
1 2023-01-04 2023-01-04
2 2023-01-17 2023-01-30
3 2023-02-04 2023-03-01
I am doing some manipulations with dplyr. I am working with the brca data set.
I have to find a solution for the below question.
" We are interested what variable might be the best indicator for the outcome
malignant ("M") or benign ("B"). There are 30 features (variables) and we
want to select one variable that has the largest difference between means
for groups M and B."
Now i want to find the difference between the two resulting rows and then find the maximum difference and the resulting column name.
Can anyone help me with this?
Thanks... :)
To get column name and the value with the highest absolute difference between two rows you can do -
library(dplyr)
library(tidyr)
sumOutcome %>%
summarise(across(-outcome, diff)) %>%
pivot_longer(cols = everything()) %>%
slice(which.max(abs(value)))
# name value
# <chr> <dbl>
#1 concave_pts_worst 436.
I need to count how many cases per column value there are and then average another column that is grouped by the column value for which I used count() function in R. But when I use count() function all of the columns except the ones that I grouped the data by have disappeared. Does anyone know how can I either attach the given count() values to the original data frame according to the column values that I used to group the data or maybe directly count the cases per column value so that the rest of the data frame columns ( which where not used to group the data) don't disappear? Thanks.
As mentioned above it would be much better if you share a piece of your data so you will have better chances of getting your desired output. However if you want the result of count function to be added to your data set use add_count instead. I hope this example is what you had in mind:
library(dplyr)
df <- tribble(
~name, ~gender, ~runs,
"Max", "male", 10,
"Sandra", "female", 1,
"Susan", "female", 4
)
df %>%
add_count(gender) %>%
group_by(gender) %>%
mutate(avg_runs = mean(runs))
# A tibble: 3 x 5
# Groups: gender [2]
name gender runs n avg_runs
<chr> <chr> <dbl> <int> <dbl>
1 Max male 10 1 10
2 Sandra female 1 2 2.5
3 Susan female 4 2 2.5
Assuming your data set name is df and for output i am creating data set called new.
You can use plyr for this :
New <- ddply(df,.(column name),summarize,Count1= sum(column name== "value1", na.rm = T),Count2= sum(column name =="value2",na.rm= T),mean1= mean(count1,na.rm=T),mean2= mean(count2,na.rm=T))
Pic shows the row number order
I am trying to add a variable to my data set that represents the row number; however every code I've found adds them in order as the rows are currently (1,2,3,4,5), rather than in the order the View option shows (129, 98, 21, 09). I need the order shown in the View option, as I am trying to merge with a another data set, and need the correct ("original row number").
I cannot add row numbers before making changes to the data set as the function doesn't work when I add the ID number.
Alternatively, being able to sort the data by row number would also help, but I don't know how to do that either (clicking on the arrow above the row number does nothing).
A bit of context
I am classifying network nodes in R. I made a matrix from the networks nodes and edges (using nodes2vec), and have to merge this matrix with nodes labels data set (this data set contains one variable which shows if nodes are positive or negative). The picture above shows the created matrix, and the original row numbers from the network data set are no longer in the original order. I need to add a variable to the matrix, that I converted to a data frame using:
netdf1 <- as.data.frame(network.node2vec)
that represents the original row number
what I tried
netdf1 <- netdf1 %>% mutate(id = row_number())
This just adds the row number as the rows are currently ordered so 1,2,3,4...
WHAT WORKED IN THE END == CORRECT ANSWER
db$ID <- rownames(db)
If I do understand your question right you have some kind of dataframe with row names that are not continuus? And now you want to have these row names in an extra column as numeric values?
You can use the row.names()-function and can convert them to numeric if you like:
# just creating a DF that might show what you mean:
testDF <- data.frame(x = 1:10, y = sample((1:1000), 10))
testDF <- testDF[testDF$y < 500,]
View(testDF)
# one possible way to get the row names
testDF$rowNum <- as.numeric(row.names(testDF))
And try to type ?sort to the console if you like to learn something about sorting vectors.
Let's say you have a data frame with row names that are out of order:
my_data <- data.frame(row.names = 5:1,
V1 = 1:5)
#> my_data
# V1
#5 1
#4 2
#3 3
#2 4
#1 5
dplyr::row_number() will add row numbers based on the current sorting, not based on the row names. (A general practice in the tidyverse is to eschew keeping useful data in the row names and to instead incorporate any sorts of row ID info into a variable.)
So you could use #user2554330's advice and add my_data$ID <- row.names(my_data) or the tidyverse equivalent of my_data %>% tibble::rownames_to_column(var = "ID"), then sort by that column.
my_data %>%
tibble::rownames_to_column(var = "ID") %>%
arrange(ID)
ID V1
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 7 years ago.
I have a data.frame that looks somewhat like this.
k <- data.frame(id = c(1,2,2,1,2,1,2,2,1,2), act = c('a','b','d','c','d','c','a','b','a','b'), var1 = 25:34, var2= 74:83)
I have to group the data into separate levels by first 2 columns and write the mean of the the next 2 columns(var1 and var2). It should look like this
id act varmean1 varmean2
1 1 a
2 1 c
3 2 a
4 2 b
5 2 b
6 2 d
The values of respective means are filled in varmean1 and varmean2.
My actual dataframe has 88 columns where I have to group the data into separate levels by the first 2 columns and find the respective means of the remaining. Please help me figure this out as soon as possible. Please try to use 'dplyr' package for the solution if possible. Thanks.
You have several options:
base R:
aggregate(. ~ id + act, k, mean)
or
aggregate(cbind(var1, var2) ~ id + act, k, mean)
The first option aggregates all the column by id and act, the second option only the column you specify. In this case both give the same result, but it is good to know for when you have more columns and only want to aggregate some of them.
dplyr:
library(dplyr)
k %>%
group_by(id, act) %>%
summarise_each(funs(mean))
If you want to specify the columns for which to calculate the mean, you can use summarise instead of summarise_each:
k %>%
group_by(id, act) %>%
summarise(var1mean = mean(var1), var2mean = mean(var2))
data.table:
library(data.table)
setDT(k)[, lapply(.SD, mean), by = .(id, act)]
If you want to specify the columns for which to calculate the mean, you can add .SDcols like:
setDT(k)[, lapply(.SD, mean), by = .(id, act), .SDcols=c("var1", "var2")]