Operations on multiple columns accross many tables - r

I have two tables (dt1, dt2). dt2 contains the same variables names as dt1.
For each variable in dt1 I would like to multiply it with its values from dt2.
In the exemple below, x from dt1 will get multiplied with 4 and y with 7.
How would be the fast way to do it?
Thank you
set.seed(123)
dt1 <- data.frame(x = sample(1:10, 10, TRUE), y = sample(1:10, 10, TRUE) )
dt1
dt2 = data.frame (names = c("x", "y"), values = c(4, 7))
dt2

purrr style
map2_df(dt1, dt2 %>% pivot_wider(names_from = names, values_from = values), ~.y * .x)
# A tibble: 10 x 2
x y
<dbl> <dbl>
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70

You can try sweep
> sweep(dt1, 2, dt2$values[match(dt2$names, names(dt1))], "*")
x y
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70
or
> dt1[] <- t(t(dt1) * dt2$values[match(dt2$names, names(dt1))])
> dt1
x y
1 12 35
2 12 21
3 40 63
4 8 63
5 24 63
6 20 21
7 16 56
8 24 70
9 36 49
10 40 70

Related

How to bin columns based on the minimum and maximum of a column

I've got a dataset that when I score needs to be converted from a continuous scale to categorical. Each value will be put into one of those categories at 10 intervals based on the minimum and maximum of that column. So if the minimum = 1 and the maximum = 100 there will be 10 categories so that any value from 1-10 = 1, and 11-20 = 2, 21-30 = 3, ..., 91-100 = 10. Here's what my data looks like
df <- as.data.frame(cbind(test1 = sample(13:52, 15),
test2 = sample(16:131, 15)))
> df
test1 test2
1 44 131
2 26 83
3 74 41
4 6 73
5 83 20
6 63 110
7 23 29
8 42 64
9 41 40
10 10 96
11 2 39
12 14 24
13 67 30
14 51 59
15 66 37
So far I have a function:
trail.bin <- function(data, col, min, max) {
for(i in 1:10) {
for(e in 0:9) {
x <- as.data.table(data)
mult <- (max - min)/10
x[col >= min+(e*mult) & col < min+(i*mult),
col := i]
}
}
return(x)
}
What I'm trying to do is take the minimum and maximum, find what the spacing of intervals would be (mult), then use two loops on a data.table reference syntax. The outcome I'm hoping for is:
df2
test1 test2
1 5 131
2 3 83
3 8 41
4 1 73
5 9 20
6 7 110
7 3 29
8 5 64
9 5 40
10 2 96
11 1 39
12 2 24
13 7 30
14 6 59
15 7 37
Thanks!
You could create a function using cut
library(data.table)
trail.bin <- function(data, col, n) {
data[, (col) := lapply(.SD, cut, n, labels = FALSE), .SDcols = col]
return(data)
}
setDT(df)
trail.bin(df, 'test1', 10)
You can also pass multiple columns
trail.bin(df, c('test1', 'test2'), 10)

Group_by / summarize by two variables within a function

I would like to write a function that summarize the provided data by some specified criteria, in this case by age
The example data is a table of users' age and their stats.
df <- data.frame('Age'=rep(18:25,2), 'X1'=10:17, 'X2'=28:35,'X4'=22:29)
Next I define the output columns that are relevant for the analysis
output_columns <- c('Age', 'X1', 'X2', 'X3')
This function computes the basic the sum of X1. X2 and X3 grouped by age.
aggr <- function(data, criteria, output_columns){
k <- data %>% .[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
#summarise_each(funs(count), age) %>%
summarize_if(is.numeric, sum)
return (k)
}
When I call it like this
> e <- aggr(df, "Age", output_columns)
> e
# A tibble: 8 x 3
Age X1 X2
<int> <int> <int>
1 18 20 56
2 19 22 58
3 20 24 60
4 21 26 62
5 22 28 64
6 23 30 66
7 24 32 68
8 25 34 70
I want to have another column called count which shows the number of observations in each age group. Desired output is
> desired
Age X1 X2 count
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
I have tried different ways to do that, e.g. tally(), summarize_each
etc. They all deliver wrong results.
I believe their should be an easy and simple way to do that.
Any help is appreciated.
Since you're already summing all variables, you can just add a column of all 1s before the summary function
aggr <- function(data, criteria, output_columns){
data %>%
.[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
mutate(n = 1L) %>%
summarize_if(is.numeric, sum)
}
# A tibble: 8 x 4
Age X1 X2 n
<int> <int> <int> <int>
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
We could create the 'count' column before summarise_if
aggr<- function(data, criteria, output_columns){
data %>%
select(intersect(names(.), output_columns))%>%
group_by_at(criteria)%>%
group_by(count = n(), add= TRUE) %>%
summarize_if(is.numeric,sum) %>%
select(setdiff(names(.), 'count'), count)
}
aggr(df,"Age",output_columns)
# A tibble: 8 x 4
# Groups: Age [8]
# Age X1 X2 count
# <int> <int> <int> <int>
#1 18 20 56 2
#2 19 22 58 2
#3 20 24 60 2
#4 21 26 62 2
#5 22 28 64 2
#6 23 30 66 2
#7 24 32 68 2
#8 25 34 70 2
In base R you could do
aggr <- function(data, criteria, output_columns){
ds <- data[, colnames(data) %in% output_columns]
d <- aggregate(ds, by=list(criteria), function(x) c(sum(x), length(x)))
"names<-"(do.call(data.frame, d)[, -c(2:3, 5)], c(names(ds), "n"))
}
> with(df, aggr(df, Age, output_columns))
Age X1 X2 n
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2

Having a subset of a data set based on a specific condition

I have a data set which the values of "age" has different units (days, months, year). I want to convert the rows which their values are based on days and months to year. How I can do it in R?
If there is no letter after the number, then the unit is years.
If there is a ‘D’ after the number, then the unit is days (e.g. 10D means 10 days)
If there is an ‘M’ after the number, then the unit is months (e.g. 5M means 5 months).
Age <- c("33","32","44","54M","67M","34D","33D","44","77","88M","49 D","55D","11M")
ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)
Data <- data.frame(ID,Age)
> Data
ID Age
1 1 33
2 2 32
3 3 44
4 4 54M
5 5 67M
6 6 34D
7 7 33D
8 8 44
9 9 77
10 10 88M
11 11 49 D
12 12 55D
13 13 11M
Here's a quick way in base R:
Data$units = ifelse(grepl("M", Data$Age), "month", ifelse(grepl("D", Data$Age), "day", "year"))
Data$value = as.numeric(gsub(pattern = "[^0-9]", replacement = "", Data$Age))
Data$result = with(Data,
ifelse(units == "year", value,
ifelse(units == "month", value / 12, value / 365.25)))
Data
# ID Age units value result
# 1 1 33 year 33 33.00000000
# 2 2 32 year 32 32.00000000
# 3 3 44 year 44 44.00000000
# 4 4 54M month 54 4.50000000
# 5 5 67M month 67 5.58333333
# 6 6 34D day 34 0.09308693
# 7 7 33D day 33 0.09034908
# 8 8 44 year 44 44.00000000
# 9 9 77 year 77 77.00000000
# 10 10 88M month 88 7.33333333
# 11 11 49 D day 49 0.13415469
# 12 12 55D day 55 0.15058179
# 13 13 11M month 11 0.91666667
And here's another option using tidyverse tools:
library(dplyr)
library(stringr)
Data %>%
mutate(Unit = str_extract(string = Age,pattern = "[DM]"),
Unit = if_else(is.na(Unit),'Y',Unit),
Age = as.numeric(gsub(pattern = "[MD]","",Age))) %>%
mutate(AgeYears = Age / c('Y' = 1,'M' = 12,'D' = 365)[Unit])
ID Age Unit AgeYears
1 1 33 Y 33.00000000
2 2 32 Y 32.00000000
3 3 44 Y 44.00000000
4 4 54 M 4.50000000
5 5 67 M 5.58333333
6 6 34 D 0.09315068
7 7 33 D 0.09041096
8 8 44 Y 44.00000000
9 9 77 Y 77.00000000
10 10 88 M 7.33333333
11 11 49 D 0.13424658
12 12 55 D 0.15068493
13 13 11 M 0.91666667
#baseR
Age <-c("33","32","44","54M","67M","34D","33D","44","77","88M","49 D","55D","11M")
AgeNum<- as.numeric(sub("\\s*\\D$","",Age))
Age[grepl("M$",Age)] <- AgeNum[grepl("M$",Age)]/12
Age[grepl("D$",Age)] <- AgeNum[grepl("D$",Age)]/365
Age <- as.numeric(Age)
result:
> Age
[1] 33.00000000 32.00000000 44.00000000 4.50000000 5.58333333 0.09315068 0.09041096 44.00000000
[9] 77.00000000 7.33333333 0.13424658 0.15068493 0.91666667
>
Additionally, a further solution using data.table:
> library(data.table)
> dt <- data.table(ID, Age)
> dt[, Unit := ifelse(grepl("D$", Age), "D", ifelse(grepl("M$", Age), "M", "Y"))][
, Age := as.integer(gsub("M|D", "", Age))]
> dt[, Age_in_years := ifelse(Unit == "Y", Age,
ifelse(Unit == "M", Age / 12, Age / 365.25))][]
ID Age Unit Age_in_years
1: 1 33 Y 33.00000000
2: 2 32 Y 32.00000000
3: 3 44 Y 44.00000000
4: 4 54 M 4.50000000
5: 5 67 M 5.58333333
6: 6 34 D 0.09308693
7: 7 33 D 0.09034908
8: 8 44 Y 44.00000000
9: 9 77 Y 77.00000000
10: 10 88 M 7.33333333
11: 11 49 D 0.13415469
12: 12 55 D 0.15058179
13: 13 11 M 0.91666667

How to properly combined columns into one column using R

I have 3 sets of data. Each one is a column of variables:
A B C
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
81 35 31
62 34 33
46 36 31
45 31 33
I have been trying to use rbind to combine these three data sets into one dataset with one column.
Combine<-rbind(A,B,C)
Instead I get something this, where not only do I end up with a series of shorter columns, the numbers all change. How do I stop this from happening?
V1 V2 V3 V4
14 9 9 5
19 15 14 5
# example data frames
dt1 = data.frame(A = 1:5)
dt2 = data.frame(B = 3:10)
dt3 = data.frame(C = 5:7)
# change to a common column name
names(dt1) = "x"
names(dt2) = "x"
names(dt3) = "x"
# bind rows
rbind(dt1, dt2, dt3)
# x
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 3
# 7 4
# 8 5
# 9 6
# 10 7
# 11 8
# 12 9
# 13 10
# 14 5
# 15 6
# 16 7

R data.frame add a column depending on row-values

In R, I have a data.frame that looks like this:
X Y
20 7
25 84
15 62
22 12
60 24
40 10
60 60
12 50
11 17
now, i want a new Colum, lets call it "SumX", that adds two following values of X into a new field of that SumX column, and one that does the same to "SumY" column. So the result data.frame would look like this:
X Y SumX SumY
20 7 20 #first row = X 7 #first row = Y
25 84 45 #X0 + X1 91 #Y0 + Y1
15 62 40 #X1 + X2 146 #Y1 + Y2
22 12 37 #X2 + X3 74 #Y2 + Y3
60 24 82 #X3 + X4 36 #Y3 + Y4
40 10 100 #X4 + X5 34 #Y4 + Y5
60 60 100 #and so on 70 #and so on
12 50 72 110
11 17 23 67
I can do simple X + Y into a new column with
myFrame$SumXY <- with(myFrame, X+Y)
but it there a simple way to add two X (n + (n-1)) values into SumX, and two Y (n + (n-1)) into SumY? Even if it is with a while-loop, though i would prefer a simpler way (its a lot of data like this). Any help is much appreciated! (I'm still pretty new to R)
The rollapply function from the zoo package will work here.
The following code block will create the rolling sum of each 2 adjacent values.
require(zoo)
myFrame$SumX <- rollapply(myFrame$X, 2, sum) # this is a rolling sum of every 2 values
You could add by = 2 as an argument to rollapply in order to not have a rolling sum (i.e. it sums values 1+2, then 3+4, then 5+6 etc.).
Look up ?rollapply for more info.
Here's a dplyr approach.
Use mutate() to add a new colum and var + lag(var, default = 0) to compute your variable. Example:
library(dplyr)
d <- data.frame(
x = 1:10,
y = 11:20,
z = 21:30
)
mutate(d, sumx = x + lag(x, default = 0))
#> x y z sumx
#> 1 1 11 21 1
#> 2 2 12 22 3
#> 3 3 13 23 5
#> 4 4 14 24 7
#> 5 5 15 25 9
#> 6 6 16 26 11
#> 7 7 17 27 13
#> 8 8 18 28 15
#> 9 9 19 29 17
#> 10 10 20 30 19
More variables can be handled similarly:
mutate(d, sumx = x + lag(x, default = 0), sumy = y + lag(y, default = 0))
#> x y z sumx sumy
#> 1 1 11 21 1 11
#> 2 2 12 22 3 23
#> 3 3 13 23 5 25
#> 4 4 14 24 7 27
#> 5 5 15 25 9 29
#> 6 6 16 26 11 31
#> 7 7 17 27 13 33
#> 8 8 18 28 15 35
#> 9 9 19 29 17 37
#> 10 10 20 30 19 39
If you know that you want to do this for many, or even EVERY column in your data frame, then here's a standard evaluation approach with mutate_() that uses a custom function I adapted from this blog post (note you need to have the lazyeval package installed). The function gets applied to each column in a for loop (which could probably be optimised).
f <- function(df, col, new_col_name) {
mutate_call <- lazyeval::interp(~ x + lag(x, default = 0), x = as.name(col))
df %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
for (var in names(d)) {
d <- f(d, var, paste0('sum', var))
}
d
#> x y z sumx sumy sumz
#> 1 1 11 21 1 11 21
#> 2 2 12 22 3 23 43
#> 3 3 13 23 5 25 45
#> 4 4 14 24 7 27 47
#> 5 5 15 25 9 29 49
#> 6 6 16 26 11 31 51
#> 7 7 17 27 13 33 53
#> 8 8 18 28 15 35 55
#> 9 9 19 29 17 37 57
#> 10 10 20 30 19 39 59
Just to continue the tidyverse theme, here's a solution using the purrr package (again, works for all columns, but can subset columns if need to):
library(purrr)
# Create new columns in new data frame.
# Subset `d` here if only want select columns
sum_d <- map_df(d, ~ . + lag(., default = 0))
# Set names correctly and
# bind back to original data
names(sum_d) <- paste0("sum", names(sum_d))
d <- cbind(d, sum_d)
d
#> x y z sumx sumy sumz
#> 1 1 11 21 2 22 42
#> 2 2 12 22 4 24 44
#> 3 3 13 23 6 26 46
#> 4 4 14 24 8 28 48
#> 5 5 15 25 10 30 50
#> 6 6 16 26 12 32 52
#> 7 7 17 27 14 34 54
#> 8 8 18 28 16 36 56
#> 9 9 19 29 18 38 58
#> 10 10 20 30 20 40 60
You can use the lag function to achieve something like this:
myFrame$SumX[1] <- X[1]
myFrame$SumX[2:nrow(myFrame)] <- X[2:nrow(myFrame)]+lag(X)[2:nrow(myFrame)]
#SumX
cumsum(df$X) - c(0, 0, cumsum(df$X)[1:(nrow(df)-2)])
#[1] 20 45 40 37 82 100 100 72 23
#SumY
cumsum(df$Y) - c(0, 0, cumsum(df$Y)[1:(nrow(df)-2)])
#[1] 7 91 146 74 36 34 70 110 67

Resources