Selectively scale a variable in R - r

Suppose you have the following dataframe named data:
Country V1 V2
US 1 2
US 2 1
US 3 1
UK 1 1
UK 2 1
UK 3 3
...
IT 2 2
Now I want to scale the variables V1 and V2. The first idea would be to use something like:
data %>%
mutate_at(.vars = c("V1", "V2"), .funs = scale)
But, what if I want to perform scaling separately for each value of the Country variable and have the result all in one dataframe?
This is just an example and the actual data which I am not able to provide contains a lot of NA. I am worried that if I use select or some of the other functions the data won't be joined back properly because of NA.

If we want to have as separate data.frame/tibble, then one option is map and store it in a list
library(dplyr)
map(c("V1", "V2"), ~ data %>%
select(Country, .x) %>%
group_by(Country)
scale)
Or if we need to do a group_by
data %>%
group_by(Country) %>%
mutate_at(vars(V1, V2), ~ c(scale(.)))

Here is solution with base R (given data frame df as in the post)
res <- (r<-Reduce(rbind,lapply(split(df,df$Country), function(x) {x[-1]<-scale(x[-1]);x})))[order(as.numeric(rownames(r))),]
such that
> res
Country V1 V2
1 US -1 1.1547005
2 US 0 -0.5773503
3 US 1 -0.5773503
4 UK -1 -0.5773503
5 UK 0 -0.5773503
6 UK 1 1.1547005
7 IT NaN NaN
DATA
df <- structure(list(Country = structure(c(3L, 3L, 3L, 2L, 2L, 2L,
1L), .Label = c("IT", "UK", "US"), class = "factor"), V1 = c(1L,
2L, 3L, 1L, 2L, 3L, 2L), V2 = c(2L, 1L, 1L, 1L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA,
-7L))

Related

Create a list that contains a numeric value for each response to a categorical variable

I have a df with a categorical value which has 2 levels: Feed, Food
structure(c(2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Feed", "Food"), class = "factor")
I want to create a list with a numeric value to match each categorical variable (ie. Feed = 0, Food = 1)
The list matches with the categorical variable to form 2 columns
Probably very simple...every time I've attempted to put the two together, both columns have ended up numeric
Like this?
library(dplyr)
df <- structure(c(2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Feed", "Food"), class = "factor")
df <- tibble("foods" = df)
df %>%
mutate(numeric = as.numeric(foods))
# A tibble: 6 x 2
foods numeric
<fct> <dbl>
1 Food 2
2 Food 2
3 Feed 1
4 Food 2
5 Feed 1
6 Food 2
Or like this if you want 0/1 as numbers.
df %>%
mutate(numeric = as.numeric(foods) - 1)

Rowwise median for multiple columns using dplyr

Given the following dataset, I want to compute for each row the median of the columns M1,M2 and M3. I am looking for a solution where the final column is added to the dataframe under the name 'Median'. The column names (M1:M3) should not be used directly (in the original dataset, there are many more columns, not just 3).
# A tibble: 8 x 5
I1 M1 M2 I2 M3
<int> <int> <int> <int> <int>
1 3 4 5 3 5
2 2 2 2 2 1
3 2 2 2 2 2
4 3 1 3 3 1
5 2 1 3 3 1
6 3 2 4 4 3
7 3 1 3 4 1
8 2 1 3 2 3
You can load the dataset using:
df = structure(list(I1 = c(3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L), M1 = c(4L,
2L, 2L, 1L, 1L, 2L, 1L, 1L), M2 = c(5L, 2L, 2L, 3L, 3L, 4L, 3L,
3L), I2 = c(3L, 2L, 2L, 3L, 3L, 4L, 4L, 2L), M3 = c(5L, 1L, 2L,
1L, 1L, 3L, 1L, 3L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .Names = c("I1", "M1", "M2", "I2",
"M3"))
I know that several similar questions have already been asked. However, most solutions posted use rowMeans or rowSums. I'm looking for a solution where:
no 'row-function' can be used.
the solution is a simple dplyr solution
The reason for (2) is that I am teaching the 'tidyverse' to total beginners.
We could use rowMedians
library(matrixStats)
library(dplyr)
df %>%
mutate(Median = rowMedians(as.matrix(.[grep('M\\d+', names(.))])))
Or if we need to use only tidyverse functions, convert it to 'long' format with gather, summarize by row and get the median of the 'value' column
df %>%
rownames_to_column('rn') %>%
gather(key, value, starts_with('M')) %>%
group_by(rn) %>%
summarise(Median = median(value)) %>%
ungroup %>%
select(-rn) %>%
bind_cols(df, .)
Or another option is rowwise() from dplyr (hope the row is not a problem)
df %>%
rowwise() %>%
mutate(Median = median(c(!!! rlang::syms(grep('M', names(.), value=TRUE)))))
Given a dataframe df with some numeric values:
df <- structure(list(X0 = c(0.82046171427112, 0.836224720981912, 0.842547521493854,
0.848014287631906, 0.850943494153631, 0.85425398956647, 0.85616876970771,
0.856855792247478, 0.857471048654811, 0.857507363153284, 0.874487063791594,
1.70684558846347, 1.95711031206168, 6.84386713155156), X1 = c(0.755674148966666,
0.765242580861224, 0.774422478168495, 0.776953642833977, 0.778128315184819,
0.778611604461183, 0.778624581647491, 0.778454002430202, 1.52708579075974,
13.0356519295685, 18.0590093408357, 21.1371199340156, 32.4192814934364,
33.2355314147089), X2 = c(0.772236670327724, 0.788112332251601,
0.797695511542613, 0.804257521548174, 0.809815828400878, 0.816592605516508,
0.819421106011397, 0.821734473885381, 0.822561946509595, 0.822334970491528,
0.822404634095793, 2.66875340820162, 1.40412743557514, 6.33377768022403
), X3 = c(0.764363881671609, 0.788288196346034, 0.79927498357549,
0.805446784334039, 0.810604881970155, 0.814634331592811, 0.817002594424753,
0.818129844752095, 0.818572101954132, 0.818630700031836, 3.06323952591121,
6.4477868357554, 11.4657041958038, 9.27821049066848)), class = "data.frame", row.names = c(NA,
-14L))
One can easily compute row-wise median using base R like so:
df$median <- sapply(
seq(nrow(df)),
function(i) df[i, 1:4] %>% unlist %>% median
)
Above I select columns manually with numeric range, but to satisfy the dplyr requirement you can use dplyr::select() to choose your columns:
df$median <- sapply(
df %>% nrow %>% seq,
function(i) df[i, ] %>%
dplyr::select(X1, X2) %>%
unlist %>% median
)
I like this method because you don't have to search for different functions to calculate anything.
For example, standard deviation:
df$sd <- sapply(
df %>% nrow %>% seq,
function(i) df[i, ] %>%
dplyr::select(X1, X2) %>%
unlist %>% sd
)

R - dplyr map slice for repeat rows

I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.

Finding the max number of occurrences from the available result

I have a dataframe which looks like -
Id Result
A 1
B 2
C 1
B 1
C 1
A 2
B 1
B 2
C 1
A 1
B 2
Now I need to calculate how many 1's and 2's are there for each Id and then select the number whose frequency of occurrence is the greatest.
Id Result
A 1
B 2
C 1
How can I do that? I have tried using the table function in some way but not able to use it effectively. Any help would be appreciated.
Here you can use aggregate in one step:
df <- structure(list(Id = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L,
3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
Result = c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
.Names = c("Id", "Result"), class = "data.frame", row.names = c(NA, -11L)
)
res <- aggregate(Result ~ Id, df, FUN=function(x){which.max(c(sum(x==1), sum(x==2)))})
res
Result:
Id Result
1 A 1
2 B 2
3 C 1
With data.table you can try (df is your data.frame):
require(data.table)
dt<-as.data.table(df)
dt[,list(times=.N),by=list(Id,Result)][,list(Result=Result[which.max(times)]),by=Id]
# Id Result
#1: A 1
#2: B 2
#3: C 1
Using dplyr, you can try
library(dplyr)
df %>% group_by(Id, Result) %>% summarize(n = n()) %>% group_by(Id) %>%
filter(n == max(n)) %>% summarize(Result = Result)
Id Result
1 A 1
2 B 2
3 C 1
An option using table and ave
subset(as.data.frame(table(df1)),ave(Freq, Id, FUN=max)==Freq, select=-3)
# Id Result
# 1 A 1
# 3 C 1
# 5 B 2

Removing duplicate rows with ddply

I have a dataframe df containing two factor variables (Var and Year) as well as one (in reality several) column with values.
df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Year = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L), .Label = c("2000", "2001",
"2002"), class = "factor"), Val = structure(c(1L, 2L, 2L, 4L,
1L, 3L, 3L, 5L, 6L, 6L), .Label = c("2", "3", "4", "5", "8",
"9"), class = "factor")), .Names = c("Var", "Year", "Val"), row.names = c(NA,
-10L), class = "data.frame")
> df
Var Year Val
1 A 2000 2
2 A 2001 3
3 A 2002 3
4 B 2000 5
5 B 2001 2
6 B 2002 4
7 B 2002 4
8 C 2000 8
9 C 2001 9
10 C 2002 9
Now I'd like to find rows with the same value for Val for each Var and Year and only keep one of those. So in this example I would like row 7 to be removed.
I've tried to find a solution with plyr using something like
df_new <- ddply(df, .(Var, Year), summarise, !duplicate(Val))
but obviously that is not a function accepted by ddply.
I found this similar question but the plyr solution by Arun only gives me a dataframe with 0 rows and 0 columns and I do not understand the answer well enough to modify it according to my needs.
Any hints on how to go about that?
Non-duplicates of Val by Var and Year are the same as non-duplicates of Val, Var, and Year. You can specify several columns for duplicated (or the whole data frame).
I think this does what you'd like.
df[!duplicated(df), ]
Or.
df[!duplicated(df[, c("Var", "Year", "Val")]), ]
you can just used the unique() function instead of !duplicate(Val)
df_new <- ddply(df, .(Var, Year), summarise, Val=unique(Val))
# or
df_new <- ddply(df, .(Var, Year), function(x) x[!duplicated(x$Val),])
# or if you only have these 3 columns:
df_new <- ddply(df, .(Var, Year), unique)
# with dplyr
df%.%group_by(Var, Year)%.%filter(!duplicated(Val))
hth
You don't need the plyr package here. If your whole dataset consists of only these 3 columns and you need to remove the duplicates, then you can use,
df_new <- unique(df)
Else, if you need to just pick up the first observation for a group by variable list, then you can use the method suggested by Richard. That's usually how I have been doing it.

Resources