group two variables(in rows) in R to create one variable [duplicate] - r

This question already has answers here:
How to merge multiple rows by a given condition and sum?
(2 answers)
Closed 2 years ago.
I have a data frame where
Disease Genemutation Mean. Total No of pateints No.of pateints.
cancertype1 BRCA1 1 10 2
cancertype2 BRCA2 5 10 3
cancertype3 BRCA2 7 10 4
cancertype1 BRCA1 8 10 1
cancertype3 BRCA2 4 10 4
cancertype2 BRCA1 6 10 1
how do I create an new variable called cancertype 4 (from cancer type 3 and cancer type 2) that includes the number of patients that have it as a result of merging the two variable?

We can use replace with %in% to replace those values (assuming 'Disease' is character class)
df1 %>%
group_by(Disease = replace(Disease,
Disease %in% c("cancertype2", "cancertype3"), "cancertype4")) %>%
summarise(TotalNoofpateints = sum(TotalNoofpateints))
-output
# A tibble: 2 x 2
# Disease TotalNoofpateints
# <chr> <int>
#1 cancertype1 20
#2 cancertype4 40

Here is a base R option using aggregate
aggregate(
Total.No.of.pateints ~ Disease,
transform(
df,
Disease = replace(Disease, Disease %in% c("cancertype2", "cancertype3"), "cancertype4")
),
sum
)
giving
Disease Total.No.of.pateints
1 cancertype1 20
2 cancertype4 40
Data
> dput(df)
structure(list(Disease = c("cancertype1", "cancertype2", "cancertype3",
"cancertype1", "cancertype3", "cancertype2"), Genemutation = c("BRCA1",
"BRCA2", "BRCA2", "BRCA1", "BRCA2", "BRCA1"), Mean. = c(1L, 5L,
7L, 8L, 4L, 6L), Total.No.of.pateints = c(10L, 10L, 10L, 10L,
10L, 10L), No.of.pateints. = c(2L, 3L, 4L, 1L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-6L))

Related

How to write a function in R that can make out the average of the 3 best scores out of 4

I am given a dataframe with 10 students, each one having a score for 4 different tests. i must select the 3 best scores and make their average using these 3
noma interro1 interro2 interro3 interro4
1 836016120449 6 3 NA 3
2 596844884419 1 4 2 8
3 803259953398 2 2 9 1
4 658786759629 3 1 3 2
5 571155022756 4 9 1 4
6 576037886365 8 7 8 7
7 045086625199 9 6 7 6
8 621909979467 5 8 4 5
9 457029205538 7 5 6 9
10 402526220817 NA 10 5 10
This dataframe provides the scores for 4 tests for 10 students.
Write a function that calculates the average score for the 3 best tests.
Calculate this average score for the 10 students.
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2, 1, average)
considering i want the 3 best, i thought that sort() could be useful here, however, what i receive is
In mean.default(x, na.rm = TRUE) :
argument is not numeric or logical: returning NA
i tried this one too
average <- function(t){
rowMeans(sort(t, decreasing = TRUE, na.rm=TRUE)[1:3])
}
UPDATE: answered, the dimensions of the dataframe were not correct in the apply line, i had to remove the first one which contained the names of the students, thus this one bellow works
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2[-1], 1, average)
Try pivot the scores, then sort the scores by name and keep the top 3 scores. Finally take the average grouping by name:
library(dplyr)
library(tidyr)
data <- data.frame(
stringsAsFactors = FALSE,
noma = c("836016120449","596844884419",
"803259953398","658786759629","571155022756",
"576037886365","045086625199","621909979467","457029205538",
"402526220817"),
interro1 = c(6L, 1L, 2L, 3L, 4L, 8L, 9L, 5L, 7L, NA),
interro2 = c(3L, 4L, 2L, 1L, 9L, 7L, 6L, 8L, 5L, 10L),
interro3 = c(NA, 2L, 9L, 3L, 1L, 8L, 7L, 4L, 6L, 5L),
interro4 = c(3L, 8L, 1L, 2L, 4L, 7L, 6L, 5L, 9L, 10L)
)
data <- data %>% pivot_longer(!noma, names_to = "interro", values_to = "value") %>% replace_na(list(value=0))
data_new1 <- data[order(data$noma, data$value, decreasing = TRUE), ] # Order data descending
data_new1 <- Reduce(rbind, by(data_new1, data_new1["noma"], head, n = 3)) # Top N highest values by group
data_new1 <- data_new1 %>% group_by(noma) %>% summarise(Value_mean = mean(value))

How to get the difference between groups with a dataframe in long format in R?

Have a simple dataframe with 2 ID's (N = 2) and 2 periods (T = 2), for example:
year id points
1 1 10
1 2 12
2 1 20
2 2 18
How does one achieves the following dataframe (preferably using dplyr or any tidyverse solution)?
id points_difference
1 10
2 6
Notice that the points_difference column is the difference between each ID in across time (namely T2 - T1).
Additionally, how to generalize for multiple columns and multiple ID (with only 2 periods)?
year id points scores
1 1 10 7
1 ... ... ...
1 N 12 8
2 1 20 9
2 ... ... ...
2 N 12 9
id points_difference scores_difference
1 10 2
... ... ...
N 0 1
If you are on dplyr 1.0.0(or higher), summarise can return multiple rows in output so this will also work if you have more than 2 periods. You can do :
library(dplyr)
df %>%
arrange(id, year) %>%
group_by(id) %>%
summarise(across(c(points, scores), diff, .names = '{col}_difference'))
# id points_difference scores_difference
# <int> <int> <int>
#1 1 10 2
#2 1 -7 1
#3 2 6 2
#4 2 -3 3
data
df <- structure(list(year = c(1L, 1L, 2L, 2L, 3L, 3L), id = c(1L, 2L,
1L, 2L, 1L, 2L), points = c(10L, 12L, 20L, 18L, 13L, 15L), scores = c(2L,
3L, 4L, 5L, 5L, 8L)), class = "data.frame", row.names = c(NA, -6L))

Summarize by column: mean and sum

I am trying to summarize a list of variables by group. Some varibles need to be summed and others need to be averaged.
I have this:
Group Variable1 Variable2
1 10 2
1 12 6
2 6 7
2 4 9
I'd like the sum of variable 1 and mean of variable 2:
Group Variable1 Variable2
1 22 4
2 10 8
I've been using dplyr to get the group sum:
sum <- (df %>%
group_by(Group) %>%
summarise_all(funs(sum)))
I'm trying to find a way to choose which columns are summed and which are averaged for the summarize function.
Thank you!
It is possible with the devel version of dplyr to selectively apply different functions on different set of variables with across
library(dplyr)
df %>%
group_by(Group) %>%
summarise(across(Variable1:Variable2, sum), across(Variable3:Variable5, mean))
# A tibble: 2 x 6
# Group Variable1 Variable2 Variable3 Variable4 Variable5
# <int> <int> <int> <dbl> <dbl> <dbl>
#1 1 22 8 18.5 5 24
#2 2 10 16 11 7 20.5
data
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Variable1 = c(10L,
12L, 6L, 4L), Variable2 = c(2L, 6L, 7L, 9L), Variable3 = c(24L,
13L, 10L, 12L), Variable4 = c(3L, 7L, 9L, 5L), Variable5 = c(26L,
22L, 23L, 18L)), class = "data.frame", row.names = c(NA, -4L))
Example data with more columns:
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Variable1 = c(10L,
12L, 6L, 4L), Variable2 = c(2L, 6L, 7L, 9L), Variable3 = c(9L,
8L, 10L, 2L), Variable4 = c(8L, 7L, 9L, 5L)), row.names = c(NA,
-4L), class = "data.frame")
# Group Variable1 Variable2 Variable3 Variable4
# 1: 1 10 2 9 8
# 2: 1 12 6 8 7
# 3: 2 6 7 10 9
# 4: 2 4 9 2 5
Create vectors of variable names and use mget + lapply in data.table
library(data.table)
setDT(df)
df[, c(lapply(mget(paste0('Variable', 1:2)), sum),
lapply(mget(paste0('Variable', 3:4)), mean)),
by = Group]
# Group Variable1 Variable2 Variable3 Variable4
# 1: 1 22 8 8.5 7.5
# 2: 2 10 16 6.0 7.0
Here is a base R solution using merge + aggregate, i.e
dfout <- merge(aggregate(Variable1~Group,df,sum),
aggregate(Variable2~Group,df,mean))
such that
> dfout
Group Variable1 Variable2
1 1 22 4
2 2 10 8
DATA
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Variable1 = c(10L,
12L, 6L, 4L), Variable2 = c(2L, 6L, 7L, 9L)), class = "data.frame", row.names = c(NA,
-4L))
We can use mutate_at to apply functions to multiple columns and then select 1st row in each group to get summarised values.
library(dplyr)
df %>%
group_by(Group) %>%
mutate_at(vars(Variable1:Variable2), sum) %>%
mutate_at(vars(Variable3:Variable4), mean) %>%
slice(1L)
# Group Variable1 Variable2 Variable3 Variable4
# <int> <int> <int> <dbl> <dbl>
#1 1 22 8 8.5 7.5
#2 2 10 16 6 7

Reducing multiple rows to 1 by index in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I am relatively new to R. I am working with a dataset that has multiple datapoints per timestamp, but they are in multiple rows. I am trying to make a single row for each timestamp with a columns for each variable.
Example dataset
Time Variable Value
10 Speed 10
10 Acc -2
10 Energy 10
15 Speed 9
15 Acc -1
20 Speed 9
20 Acc 0
20 Energy 2
I'd like to convert this to
Time Speed Acc Energy
10 10 -2 10
15 9 -1 (blank or N/A)
20 8 0 2
These are measured values so they are not always complete.
I have tried ddply to extract each individual value into an array and recombine, but the columns are different lengths. I have tried aggregate, but I can't figure out how to keep the variable and value linked. I know I could do this with a for loop type solution, but that seems a poor way to do this in R. Any advice or direction would help. Thanks!
I assume data.frame's name is df
library(tidyr)
spread(df,Variable,Value)
Typically a job for dcast in reshape2.First, we make your example reproducible:
df <- structure(list(Time = c(10L, 10L, 10L, 15L, 15L, 20L, 20L, 20L),
Variable = structure(c(3L, 1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("Acc",
"Energy", "Speed"), class = "factor"), Value = c(10L, -2L, 10L,
9L, -1L, 9L, 0L, 2L)), .Names = c("Time", "Variable", "Value"),
class = "data.frame", row.names = c(NA, -8L))
Then:
library(reshape2)
dcast(df, Time ~ ...)
Time Acc Energy Speed
10 -2 10 10
15 -1 NA 9
20 0 2 9
With dplyr you can (cosmetics) reorder the columns with:
library(dplyr)
dcast(df, Time ~ ...) %>% select(Time, Speed, Acc, Energy)
Time Speed Acc Energy
10 10 -2 10
15 9 -1 NA
20 9 0 2

Subset first n occurrences of certain value in dataframe

Suppose I have a matrix (or dataframe):
1 5 8
3 4 9
3 9 6
6 9 3
3 1 2
4 7 2
3 8 6
3 2 7
I would like to select only the first three rows that have "3" as their first entry, as follows:
3 4 9
3 9 6
3 1 2
It is clear to me how to pull out all rows that begin with "3" and it is clear how to pull out just the first row that begins with "3."
But in general, how can I extract the first n rows that begin with "3"?
Furthermore, how can I select just the 3rd and 4th appearances, as follows:
3 1 2
3 8 6
Without the need for an extra package:
mydf[mydf$V1==3,][1:3,]
results in:
V1 V2 V3
2 3 4 9
3 3 9 6
5 3 1 2
When you need the third and fourth row:
mydf[mydf$V1==3,][3:4,]
# or:
mydf[mydf$V1==3,][c(3,4),]
Used data:
mydf <- structure(list(V1 = c(1L, 3L, 3L, 6L, 3L, 4L, 3L, 3L),
V2 = c(5L, 4L, 9L, 9L, 1L, 7L, 8L, 2L),
V3 = c(8L, 9L, 6L, 3L, 2L, 2L, 6L, 7L)),
.Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
Bonus material: besides dplyr, you can do this also very efficiently with data.table (see this answer for speed comparisons on large datasets for the different data.table methods):
setDT(mydf)[V1==3, head(.SD,3)]
# or:
setDT(mydf)[V1==3, .SD[1:3]]
You can do something like this with dplyr to extract first three rows of each unique value of that column:
library(dplyr)
df %>% arrange(columnName) %>% group_by(columnName) %>% slice(1:3)
If you want to extract only three rows when the value of that column, you can try:
df %>% filter(columnName == 3) %>% slice(1:3)
If you want specific rows, you can supply to slice as c(3, 4), for example.
We could also use subset
head(subset(mydf, V1==3),3)
Update
If we need to extract also one row below the rows where V1==3,
i1 <- with(mydf, V1==3)
mydf[sort(unique(c(which(i1),pmin(which(i1)+1L, nrow(mydf))))),]

Resources