Given the following dataset, I want to compute for each row the median of the columns M1,M2 and M3. I am looking for a solution where the final column is added to the dataframe under the name 'Median'. The column names (M1:M3) should not be used directly (in the original dataset, there are many more columns, not just 3).
# A tibble: 8 x 5
I1 M1 M2 I2 M3
<int> <int> <int> <int> <int>
1 3 4 5 3 5
2 2 2 2 2 1
3 2 2 2 2 2
4 3 1 3 3 1
5 2 1 3 3 1
6 3 2 4 4 3
7 3 1 3 4 1
8 2 1 3 2 3
You can load the dataset using:
df = structure(list(I1 = c(3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L), M1 = c(4L,
2L, 2L, 1L, 1L, 2L, 1L, 1L), M2 = c(5L, 2L, 2L, 3L, 3L, 4L, 3L,
3L), I2 = c(3L, 2L, 2L, 3L, 3L, 4L, 4L, 2L), M3 = c(5L, 1L, 2L,
1L, 1L, 3L, 1L, 3L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .Names = c("I1", "M1", "M2", "I2",
"M3"))
I know that several similar questions have already been asked. However, most solutions posted use rowMeans or rowSums. I'm looking for a solution where:
no 'row-function' can be used.
the solution is a simple dplyr solution
The reason for (2) is that I am teaching the 'tidyverse' to total beginners.
We could use rowMedians
library(matrixStats)
library(dplyr)
df %>%
mutate(Median = rowMedians(as.matrix(.[grep('M\\d+', names(.))])))
Or if we need to use only tidyverse functions, convert it to 'long' format with gather, summarize by row and get the median of the 'value' column
df %>%
rownames_to_column('rn') %>%
gather(key, value, starts_with('M')) %>%
group_by(rn) %>%
summarise(Median = median(value)) %>%
ungroup %>%
select(-rn) %>%
bind_cols(df, .)
Or another option is rowwise() from dplyr (hope the row is not a problem)
df %>%
rowwise() %>%
mutate(Median = median(c(!!! rlang::syms(grep('M', names(.), value=TRUE)))))
Given a dataframe df with some numeric values:
df <- structure(list(X0 = c(0.82046171427112, 0.836224720981912, 0.842547521493854,
0.848014287631906, 0.850943494153631, 0.85425398956647, 0.85616876970771,
0.856855792247478, 0.857471048654811, 0.857507363153284, 0.874487063791594,
1.70684558846347, 1.95711031206168, 6.84386713155156), X1 = c(0.755674148966666,
0.765242580861224, 0.774422478168495, 0.776953642833977, 0.778128315184819,
0.778611604461183, 0.778624581647491, 0.778454002430202, 1.52708579075974,
13.0356519295685, 18.0590093408357, 21.1371199340156, 32.4192814934364,
33.2355314147089), X2 = c(0.772236670327724, 0.788112332251601,
0.797695511542613, 0.804257521548174, 0.809815828400878, 0.816592605516508,
0.819421106011397, 0.821734473885381, 0.822561946509595, 0.822334970491528,
0.822404634095793, 2.66875340820162, 1.40412743557514, 6.33377768022403
), X3 = c(0.764363881671609, 0.788288196346034, 0.79927498357549,
0.805446784334039, 0.810604881970155, 0.814634331592811, 0.817002594424753,
0.818129844752095, 0.818572101954132, 0.818630700031836, 3.06323952591121,
6.4477868357554, 11.4657041958038, 9.27821049066848)), class = "data.frame", row.names = c(NA,
-14L))
One can easily compute row-wise median using base R like so:
df$median <- sapply(
seq(nrow(df)),
function(i) df[i, 1:4] %>% unlist %>% median
)
Above I select columns manually with numeric range, but to satisfy the dplyr requirement you can use dplyr::select() to choose your columns:
df$median <- sapply(
df %>% nrow %>% seq,
function(i) df[i, ] %>%
dplyr::select(X1, X2) %>%
unlist %>% median
)
I like this method because you don't have to search for different functions to calculate anything.
For example, standard deviation:
df$sd <- sapply(
df %>% nrow %>% seq,
function(i) df[i, ] %>%
dplyr::select(X1, X2) %>%
unlist %>% sd
)
Related
Suppose you have the following dataframe named data:
Country V1 V2
US 1 2
US 2 1
US 3 1
UK 1 1
UK 2 1
UK 3 3
...
IT 2 2
Now I want to scale the variables V1 and V2. The first idea would be to use something like:
data %>%
mutate_at(.vars = c("V1", "V2"), .funs = scale)
But, what if I want to perform scaling separately for each value of the Country variable and have the result all in one dataframe?
This is just an example and the actual data which I am not able to provide contains a lot of NA. I am worried that if I use select or some of the other functions the data won't be joined back properly because of NA.
If we want to have as separate data.frame/tibble, then one option is map and store it in a list
library(dplyr)
map(c("V1", "V2"), ~ data %>%
select(Country, .x) %>%
group_by(Country)
scale)
Or if we need to do a group_by
data %>%
group_by(Country) %>%
mutate_at(vars(V1, V2), ~ c(scale(.)))
Here is solution with base R (given data frame df as in the post)
res <- (r<-Reduce(rbind,lapply(split(df,df$Country), function(x) {x[-1]<-scale(x[-1]);x})))[order(as.numeric(rownames(r))),]
such that
> res
Country V1 V2
1 US -1 1.1547005
2 US 0 -0.5773503
3 US 1 -0.5773503
4 UK -1 -0.5773503
5 UK 0 -0.5773503
6 UK 1 1.1547005
7 IT NaN NaN
DATA
df <- structure(list(Country = structure(c(3L, 3L, 3L, 2L, 2L, 2L,
1L), .Label = c("IT", "UK", "US"), class = "factor"), V1 = c(1L,
2L, 3L, 1L, 2L, 3L, 2L), V2 = c(2L, 1L, 1L, 1L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-7L))
In a data table, all the cells are numeric, and what i want do is to replace all the numbers into a string like this:
Numbers in [0,2]: replace them with the string "Bad"
Numbers in [3,4]: replace them with the string "Good"
Numbers > 4 : replace them with the string "Excellent"
Here's an example of my original table called "data.active":
My attempt to do that is this:
x <- c("churches","resorts","beaches","parks","Theatres",.....)
for(i in x){
data.active$i <- as.character(data.active$i)
data.active$i[data.active$i <= 2] <- "Bad"
data.active$i[data.active$i >2 && data.active$i <=4] <- "Good"
data.active$i[data.active$i >4] <- "Excellent"
}
But it doesn't work. is there any other way to do this?
EDIT
Here's the link to my dataset GoogleReviews_Dataset and here's how i got the table in the image above:
library(FactoMineR)
library(factoextra)
data<-read.csv2(file.choose())
data.active <- data[1:10, 4:8]
You can use the tidyverse's mutate-across combination to condition on the ranges:
library(tidyverse)
df <- tibble(
x = 1:5,
y = c(1L, 2L, 2L, 2L, 3L),
z = c(1L,3L, 3L, 3L, 2L),
a = c(1L, 5L, 6L, 4L, 8L),
b = c(1L, 3L, 4L, 7L, 1L)
)
df %>% mutate(
across(
.cols = everything(),
.fns = ~ case_when(
.x <= 2 ~ 'Bad',
(.x > 3) & (. <= 4) ~ 'Good',
(.x > 4) ~ 'Excellent',
TRUE ~ as.character(.x)
)
)
)
The .x above represents the element being evaluated (using a purrr-style functioning). This results in
# A tibble: 5 x 5
x y z a b
<chr> <chr> <chr> <chr> <chr>
1 Bad Bad Bad Bad Bad
2 Bad Bad 3 Excellent 3
3 3 Bad 3 Excellent Good
4 Good Bad 3 Good Excellent
5 Excellent 3 Bad Excellent Bad
For changing only select columns, use a selection in your .cols parameter for across:
df %>% mutate(
across(
.cols = c('a', 'x', 'b'),
.fns = ~ case_when(
.x <= 2 ~ 'Bad',
(.x > 3) & (. <= 4) ~ 'Good',
(.x > 4) ~ 'Excellent',
TRUE ~ as.character(.x)
)
)
)
This yields
# A tibble: 5 x 5
x y z a b
<chr> <int> <int> <chr> <chr>
1 Bad 1 1 Bad Bad
2 Bad 2 3 Excellent 3
3 3 2 3 Excellent Good
4 Good 2 3 Good Excellent
5 Excellent 3 2 Excellent Bad
x<-c('x','y','z')
df[,x] <- lapply(df[,x], function(x)
cut(x ,breaks=c(-Inf,2,4,Inf),labels=c('Bad','Good','Excellent'))))
Data
df<-structure(list(x = 1:5, y = c(1L, 2L, 2L, 2L, 3L), z = c(1L,3L, 3L, 3L, 2L),
a = c(1L, 5L, 6L, 4L, 8L),b = c(1L, 3L, 4L, 7L, 1L)),
class = "data.frame", row.names = c(NA, -5L))
I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)
I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)
I have a dataframe which looks like -
Id Result
A 1
B 2
C 1
B 1
C 1
A 2
B 1
B 2
C 1
A 1
B 2
Now I need to calculate how many 1's and 2's are there for each Id and then select the number whose frequency of occurrence is the greatest.
Id Result
A 1
B 2
C 1
How can I do that? I have tried using the table function in some way but not able to use it effectively. Any help would be appreciated.
Here you can use aggregate in one step:
df <- structure(list(Id = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L,
3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
Result = c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
.Names = c("Id", "Result"), class = "data.frame", row.names = c(NA, -11L)
)
res <- aggregate(Result ~ Id, df, FUN=function(x){which.max(c(sum(x==1), sum(x==2)))})
res
Result:
Id Result
1 A 1
2 B 2
3 C 1
With data.table you can try (df is your data.frame):
require(data.table)
dt<-as.data.table(df)
dt[,list(times=.N),by=list(Id,Result)][,list(Result=Result[which.max(times)]),by=Id]
# Id Result
#1: A 1
#2: B 2
#3: C 1
Using dplyr, you can try
library(dplyr)
df %>% group_by(Id, Result) %>% summarize(n = n()) %>% group_by(Id) %>%
filter(n == max(n)) %>% summarize(Result = Result)
Id Result
1 A 1
2 B 2
3 C 1
An option using table and ave
subset(as.data.frame(table(df1)),ave(Freq, Id, FUN=max)==Freq, select=-3)
# Id Result
# 1 A 1
# 3 C 1
# 5 B 2