I have to perform some simple operations upon few vectors and rows.
Assume that I have a database such as:
observation outcome_1_a outcome_2_a outcome_1_b outcome_2_b choice_a choice_b
1 41 34 56 19 1 1
2 32 78 43 6 2 1
3 39 19 18 55 1 2
For each observation, outcome_1 and outcome_2 are the two possible outcomes, choice is the outcome chosen and the prefix _i, with i = a,b, indicates the number of times the situation is repeated.
If I want to create variables storing the highest outcome for each situation (a,b), hence:
max.a <- pmax(data$outcome_1_a, data$outcome_2_a)
max.b <- pmax(data$outcome_1_b, data$outcome_2_b)
Similarly, if I want to create variables storing the values chosen in each situation, I can do:
choice.a <- ifelse(data$choice_a == "1", data$outcome_1_a, data$outcome_1_b)
choice.b <- ifelse(data$choice_b == "1", data$outcome_2_a, data$outcome_2_b)
Finally, If I'd like to compute the mean by row of the situations a and b, I can do:
library(data.table)
setDT(data)
data[, .(Mean = rowMeans(.SD)), by = observation, .SDcols = c("outcome_1_a","outcome_2_a", "outcome_1_b", "outcome_2_b")]
Now, all of these work just fine. However, I was wondering if such operations can be done in a more efficient way.
In the example there are only few situations, but, if in the future I'll have to deal with, let's say, 15 or more different situations (a,b,c,d,...,), writing such operations might be annoying.
Is there a way to automate such process based on the different prefixes and/or suffixes of the variables?
Thank you for your help
You can select columns with some regex. For example, to get your max.a value.
library(data.table)
setDT(data)
data[, do.call(pmax, .SD), .SDcols = names(data) %like% "\\d+_a$"]
[1] 41 78 39
Alternatively, you could select your columns with some regex outside of the data.table. Lots of ways to go about this.
Similar application to your last command.
data[,
.(Mean = rowMeans(.SD)),
by = observation,
.SDcols = names(data) %like% "^outcome"]
observation Mean
1: 1 37.50
2: 2 39.75
3: 3 32.75
For choice.a, how would you choose between b, c, d, e etc?
For instance:
outcome_1_a outcome_2_a outcome_1_b outcome_2_b outcome_1_c outcome_2_c outcome_1_d outcome_2_d outcome_1_e outcome_2_e choice_a choice_b choice_c choice_d choice_e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 85 32 28 91 42 32 96 27 29 2 1 1 1 1
2 17 22 84 53 11 69 16 66 11 41 1 2 2 1 1
3 92 98 76 83 18 27 21 51 92 41 1 1 1 1 2
4 63 49 61 64 100 28 43 51 22 94 1 2 1 1 1
Define an index variable that will help you go through the loops:
seqmax <- seq(1, 10, by = 2)
seqmax is a 1 3 5 7 9. The reason being is that there are 5 letters "a" "b" "c" "d" "e". So this sequence will help you to pattern the loop. This can be automated for the max number of letters, just find the column index for the last column before choice_a. Then you can do seq(1, grep(names(data), pattern = "choice_a") - 1, by = 2). The by = 2 argument can be adjusted for the number of columns by letter.
I use lapply with <<- to assing the new column to data.
lapply(c(1:5), function(x){
data[, paste0("max.", letters[x])] <<- apply(data[, c(seqmax[x], seqmax[x] + 1)], 1, max)
data[, paste0("choice.", letters[x])] <<- ifelse(
data[, grep(names(data), pattern = paste0("choice_", letters[x]), value = T)] == 1,
data[, seqmax[x]], data[, seqmax[x] + 1])
data[, paste0("mean.", letters[x])] <<- rowMeans(
data[, grep(names(data), pattern = paste0("outcome_\\d+_", letters[x]), value = T)])
})
Related
I must be missing something, but I don't understand why my map function is not working.
I have a small data set, to which I want to append a ned column - sum of two existing columns:
DT <- data.table("columnPred" = c(1,2,3),
"column1" = c(7,8,9),
"column2" = c(44,55,66),
"new_column1" = rep(NA, 3))
I wrote my function to sum up:
test_map <- function(x){
x$new_column1 = x$column1 + x$column2
}
and run map:
map(DT, test_map)
I keep getting errors. What is wrong with my map function? How can I use map to repeat the same function row-wise? is there a better alterntive
We do not need map for that:
library(data.table)
DT[,new_column1 := (column1 + column2)][]
#> columnPred column1 column2 new_column1
#> 1: 1 7 44 51
#> 2: 2 8 55 63
#> 3: 3 9 66 75
However, if we want a map function to get the sum of the two columns, we can do the following:
library(data.table)
library(purrr)
pmap_dbl(DT, ~ ..2 + ..3)
#> [1] 51 63 75
We could use rowSums( which would also take care of NA elements if present)
library(data.table)
DT[, new_column1 := rowSums(.SD, na.rm = TRUE), .SDcols = column1:column2]
-output
> DT
columnPred column1 column2 new_column1
<num> <num> <num> <num>
1: 1 7 44 51
2: 2 8 55 63
3: 3 9 66 75
I've got a data.frame dt with some duplicate keys and missing data, i.e.
Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA
In this case the key is the name, and I would like to apply to each column a function like
f <- function(x){
x <- x[!is.na(x)]
x <- x[1]
return(x)
}
while aggregating by the key (i.e., the "Name" column), so as to obtain as a result
Name Height Weight Age
Alice 180 70 35
Bob NA 80 27
Charles 170 75 NA
I tried
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f)
and I got some errors, then I tried the following
dt_agg_1 <- aggregate(Height ~ Name,
data = dt,
FUN = f)
dt_agg_2 <- aggregate(Weight ~ Name,
data = dt,
FUN = f)
and this time it worked.
Since I have 50 columns, this second approach is quite cumbersome for me. Is there a way to fix the first approach?
Thanks for help!
You were very close with the aggregate function, you needed to adjust how aggregate handles NA (from na.omit to na.pass). My guess is that aggregate removes all rows with NA first and then does its aggregating, instead of removing NAs as aggregate iterates over the columns to be aggregated. Since your example dataframe you have an NA in each row you end up with a 0-row dataframe (which is the error I was getting when running your code). I tested this by removing all but one NA and your code works as-is. So we set na.action = na.pass to pass the NA's through.
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f, na.action = "na.pass")
original answer
dt_agg <- aggregate(dt[, -1],
by = list(dt$Name),
FUN = f)
dt_agg
# Group.1 Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
You can do this with dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
summarize_all(funs(sort(.)[1]))
Result:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <int> <int> <int>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
Data:
df = read.table(text = "Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA", header = TRUE)
Here is an option with data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) head(sort(x), 1)), Name]
# Name Height Weight Age
#1: Alice 180 70 35
#2: Bob NA 80 27
#3: Charles 170 75 NA
Simply, add na.action=na.pass in aggregate() call:
aggdf <- aggregate(.~Name, data=df, FUN=f, na.action=na.pass)
# Name Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
If you add an ifelse() to your function to make sure the function returns a value if all values are NA:
f <- function(x) {
x <- x[!is.na(x)]
ifelse(length(x) == 0, NA, x)
}
You can use dplyr to aggregate:
library(dplyr)
dt %>% group_by(Name) %>% summarise_all(funs(f))
This returns:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <dbl> <dbl> <dbl>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
I have a simple data.frame that looks like this:
Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87
I need to first need to find the mean of Score_1, collapsing across persons within a group (i.e., the Score_1 mean for Group 1, the Score_1 mean for Group 2, etc.), and then I need to collapse across all both groups to find the mean of Score_1. How can I calculate these values and store them as individual objects? I have used the "summarise" function in dplyr, with the following code:
summarise(group_by(data,Group),mean(bias,na.rm=TRUE))
I would like to ultimately create a 6th column that gives the mean, repeated across persons for each group, and then a 7th column that gives the grand mean across all groups.
I'm sure there are other ways to do this, and I am open to suggestions (although I would still like to know how to do it in dplyr). Thanks!
data.table is good for tasks like this:
library(data.table)
dt <- read.table(text = "Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87", header = T)
dt <- data.table(dt)
# Mean by group
dt[, score.1.mean.by.group := mean(Score_1), by = .(Group)]
# Grand mean
dt[, score.1.mean := mean(Score_1)]
dt
To create a column, we use mutate and not summarise. We get the grand mean (MeanScore1), then grouped by 'Group', get the mean by group ('MeanScorebyGroup') and finally order the columns with select
library(dplyr)
df1 %>%
mutate(MeanScore1 = mean(Score_1)) %>%
group_by(Group) %>%
mutate(MeanScorebyGroup = mean(Score_1)) %>%
select(1:5, 7, 6)
But, this can also be done using base R in simple way
df1$MeanScorebyGroup <- with(df1, ave(Score_1, Group))
df1$MeanScore1 <- mean(df1$Score_1)
#akrun you just blew my mind!
Just to clarify what you said, here's my interpretation:
library(plyr)
Group <- c(1,1,1,2,2,2)
Person <- c(1,2,3,1,2,3)
Score_1 <- c(90,74,74,33,94,50)
Score_2 <- c(80,83,94,9,32,90)
Score_3 <- c(79,28,89,8,78,87)
df <- data.frame(cbind(Group, Person, Score_1, Score_2, Score_3))
df2 <- ddply(df, .(Group), mutate, meanScore = mean(Score_1, na.rm=T))
mutate(df2, meanScoreAll=mean(meanScore))
I have 3 data set. All of them has 1 column called ID. I would like to list out each ID for whole 3 tables (I'm not sure I'm explaining right). For example
df1
ID age
1 34
2 33
5 34
7 35
43 32
76 33
df2
ID height
1 178
2 176
5 166
7 159
43 180
76 178
df3
ID class type
1 a 1
2 b 1
5 a 2
7 b 3
43 b 2
76 a 3
I would like to have an output which looks like this
ID = 1
df1 age
34
df2 height
178
df3 class type
a 1
ID = 2
df1 age
33
df2 height
176
df3 class type
b 1
I wrote a script
listing <- function(x) {
for(i in 1:n) {
data <- print(x[x$ID == 'i', ])
print(data)
}
return(data)
}
why am I not getting the output I wanted?
This is a hack. If you want/need to export to a word document, I strongly urge you to use something like R-Markdown (such as RStudio) using knitr (and, behind the scenes, pandoc). I'd encourage you to look at knitr::kable, for instance, as well as better looping structures for dealing with large numbers of datasets.
This hack can be improved considerably. But it gets you the output you want.
func <- function(...) {
dfnames <- as.character(match.call()[-1])
dfs <- setNames(list(...), dfnames)
IDs <- unique(unlist(lapply(dfs, `[[`, "ID")))
fmt <- paste("%", max(nchar(dfnames)), "s %s", sep = "")
for (id in IDs) {
cat(sprintf("ID = %d\n", id))
for (nm in dfnames) {
df <- dfs[[nm]][ dfs[[nm]]$ID == id, names(dfs[[nm]]) != "ID", drop =FALSE]
cat(paste(sprintf(fmt, c(nm, ""),
capture.output(print(df, row.names = FALSE))),
collapse = "\n"), "\n")
}
}
}
Execution. Though this is showing just two data.frames, you can provide an arbitrary number of data.frames (and in your preferred order) in the function arguments. It assumes you are providing them as direct variables and not subsetting within the function call ... you'll understand if you try it.
func(df1, df3)
# ID = 1
# df1 age
# 34
# df3 class type
# a 1
# ID = 2
# df1 age
# 33
# df3 class type
# b 1
# ID = 5
# df1 age
# 34
# df3 class type
# a 2
# ID = 7
# df1 age
# 35
# df3 class type
# b 3
# ID = 43
# df1 age
# 32
# df3 class type
# b 2
# ID = 76
# df1 age
# 33
# df3 class type
# a 3
(Personally, I can't imagine providing output in this format, but I don't know your tastes or use-case. There are many many other ways to show data like this. Like:
Reduce(function(x,y) merge(x, y, by = "ID"), list(df1, df2, df3))
# ID age height class type
# 1 1 34 178 a 1
# 2 2 33 176 b 1
# 3 5 34 166 a 2
# 4 7 35 159 b 3
# 5 43 32 180 b 2
# 6 76 33 178 a 3
It's much more concise. But, then again, I'm also assuming that you want to show them all at once instead of "show one, talk about it, then show another one, talk about it ...".)
Why not do a merge by id ?
df_1 <- merge( df1, df2, by='ID')
df_fianl <- merge( df_1, df3, by='ID')
or by using
library(dplyr)
full_join(df1, df2)
I would like to find the most frequent combination of values in a data.frame.
Here's some example data:
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
In this example the result I am looking for is the combination of age=55, sex=1 and bmi=25, since that is the most frequent combination of column values.
My real data has about 30000 rows and 20 columns. What would be an efficient way to find the most common combination of these 20 values among the 30000 observations?
Many thanks!
Here's an approach with data.table:
dt <- data.table(dat)
setkeyv(dt, names(dt))
dt[, .N, by = key(dt)]
dt[, .N, by = key(dt)][N == max(N)]
# age sex bmi N
# 1: 55 1 25 2
And an approach with base R:
x <- data.frame(table(dat))
x[x$Freq == max(x$Freq), ]
# age sex bmi Freq
# 11 55 1 25 2
I don't know how well either of these scale though, particularly if the number of combinations is going to be large. So, test back and report!
Replace x$Freq == max(x$Freq) with which.max(x$Freq) and N == max(N) with which.max(N) if you are really just interested in one row of results.
The quick and dirty solution. I am sure there is a fancier way to it though, with the plyr package or similar.
> (tab <- table(apply(dat, 1, paste, collapse=", ")))
50, 0, 20 50, 1, 20 55, 1, 25 60, 1, 30
1 1 2 1
> names(which.max(tab))
[1] "55, 1, 25"
Something like this??
> dat[duplicated(dat), ]
age sex bmi
5 55 1 25
using while (maybe time consuming)
Here's another data.frame with more than 1 case duplicated
> dat <- data.frame(age=c(50,55,60,50,55, 55, 60),
sex=c(1,1,1,0,1, 1,1),
bmi=c(20,25,30,20,25, 25,30))
> dat[duplicated(dat), ] # see data.frame
age sex bmi
5 55 1 25
6 55 1 25
7 60 1 30
# finding the most repeated item
> while(any(duplicated(dat))){
dat <- dat[duplicated(dat), ]
#print(dat)
}
> print(dat)
age sex bmi
6 55 1 25
Here's a tidyverse solution. Grouping by all variables and getting the count per group has the benefit that you can see the counts of all other groups, not just the max.
library(tidyverse)
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
dat %>%
group_by_all() %>%
summarise(count = n()) %>%
arrange(desc(count))
#> # A tibble: 4 x 4
#> # Groups: age, sex [4]
#> age sex bmi count
#> <dbl> <dbl> <dbl> <int>
#> 1 55 1 25 2
#> 2 50 0 20 1
#> 3 50 1 20 1
#> 4 60 1 30 1
Created on 2018-10-17 by the reprex package (v0.2.0).