Sum of every n-th column of a data frame - r

Let's assume the data,
a <- c(10, 20, 30, 40, 50)
b <- c(100, 200, 300, 400, 500)
c <- c(1, 2, 3, 4, 5)
d <- c(5, 4, 3, 2, 1)
df <- data.frame(a, b, c, d)
df
a b c d
1 10 100 1 5
2 20 200 2 4
3 30 300 3 3
4 40 400 4 2
5 50 500 5 1
I want to sum every alternate columns, i.e. a+cand b+d and so on. The solution should be applicable or modified very easily to other cases like summing every second column, i.e. a+c, b+d, c+e etc. For the example above, the solution should look like this,
> dfsum
aplusc bplusd
1 11 105
2 22 204
3 33 303
4 44 402
5 55 501
Is there any easy way to do that? I have figured out how to do sequential sum, e.g. df[,c(T, F)] + df[,c(F, T)];, but how to do sum of every n-th column? Besides rbase, is there any tidy solution for this problem?

Here is a more generic approach which however, assumes that the number of columns in your data frame is even number, i.e.
n = 2
Reduce(`+`, split.default(df, rep(seq(ncol(df) / n), each = ncol(df) / n)))
# a b
#1 11 105
#2 22 204
#3 33 303
#4 44 402
#5 55 501
The above basically splits the dataframe every 2 columns, i.e. a and b, c and d. Using Reduce, all first elements are added together, then all seconds and so on. So for your case, a will be added with c, and b with d. If you want to take the sum every 3 columns, just change the denominator of the above split.default method to 3. However, note that you must have a number of columns divisible by 3 (or by any n).

One approach is to use mutate:
library(tidyverse)
df %>%
mutate(aplusc = a + c,
bplusd = b + d) %>%
select(aplusc, bplusd)
#aplusc bplusd
#1 11 105
#2 22 204
#3 33 303
#4 44 402
#5 55 501
Edit
Here's an approach based on #Sotos's anwer, so it could work on a larger dataset:
Reduce(`+`, split.default(df, (seq_along(df) - 1) %/% 2))

Related

Re-bin a data frame in R

I have a data frame which holds activity (A) data across time (T) for a number of subjects (S) in different groups (G). The activity data were sampled every 10 minutes. What I would like to do is to re-bin the data into, say, 30-minute bins (either adding or averaging values) keeping the subject Id and group information.
Example. I have something like this:
S G T A
1 A 30 25
1 A 40 20
1 A 50 15
1 A 60 20
1 A 70 5
1 A 80 20
2 B 30 10
2 B 40 10
2 B 50 10
2 B 60 20
2 B 70 20
2 B 80 20
And I'd like something like this:
S G T A
1 A 40 20
1 A 70 15
2 B 40 10
2 B 70 20
Whether time is the average time (as in the example) or the first/last time point and whether the activity is averaged (again, as in the example) or summed is not important for now.
I will appreciate any help you can provide on this. I was thinking about creating a script in Python to re-bin this particular dataframe, but I thought that there may be a way of doing it in R in a way that may be applied to any dataframe with differing numbers of columns, etc.
There are some ways to come to the wished dataframe.
I have reproduced your dataframe:
df <- data.frame(S = c(rep(1,6),rep(2,6)),
G = c(rep("A",6),rep("B",6)),
T = rep(seq(30,80,10),2),
A = c(25, 20, 15, 20, 5, 20, 10, 10, 10, 20, 20, 20))
The classical way could be:
df[df$T == 40 | df$T == 70,]
The more modern tidyverse way is
library(tidyverse)
df %>% filter(T == 40 | T ==70)
If you want to get the average of each group of G filtered for T==40 and 70:
df %>% filter(T == 40 | T == 70) %>%
group_by(G) %>%
mutate(A = mean(A))

Combine data.frames of different dimensions creating duplicates where needed /r dplyr

I am looking for a way to combine two tables of different dimensions by ID. But the final table should have some douplicated values depending on each table.
Here is a random example:
IDx = c("a", "b", "c", "d")
sex = c("M", "F", "M", "F")
IDy = c("a", "a", "b", "c", "d", "d")
status = c("single", "children", "single", "children", "single", "children")
salary = c(30, 80, 50, 40, 30, 80)
x = data.frame(IDx, sex)
y = data.frame(IDy, status, salary)
Here is x:
IDx sex
1 a M
2 b F
3 c M
4 d F
Here is y:
IDy status salary
1 a single 30
2 a children 80
3 b single 50
4 c children 40
5 d single 30
6 d children 80
I am looking for this:
IDy sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80
Basically, sex should be matched to fit the needs of table y. All values in both tables should be used, the actual table is a lot larger. Not all IDs will need to duplicate.
This should be fairly simple, but I cannot find a good answer anywhere online.
Note, I don't want NAs to be introduced.
I am new in R and since I have been focused in dplyr it would help if the example comes from there. It might be simple with base R, too.
UPDATE
The bolded sentences above might be confusing to the final answer. Sorry, it has been a confusing case which I realised should include one extra column tha complicates things, but more of that later.
First, I tried to see what is happening on my actuall table and to find which suggested answer fits my needs. I removed any problematic columns for the following result. So, I checked this:
dim(x)
> [1] 231 2
dim(y)
> [1] 199 8
# left_join joins matching rows from y to x
suchait <- left_join(x, y, by= c("IDx" = "IDy"))
# inner_join retains only rows in both sets
jdobres <- inner_join(y, anno2, by = c(IDx = "IDy"))
dim(suchait) # actuall table used
> [1] 225 9
dim(jdobres)
> [1] 219 9
But why/where do they look different?
This shows the 6 rows that are introduced in suchait's table but not on jdobres and it is because of the different approach.
setdiff(suchait, jdobres )
Using dplyr:
library(dplyr)
df <- left_join(x, y, by = c("IDx" = "IDy"))
Your result would be:
IDx sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80
Or you could do:
df <- left_join(y, x, by = c("IDy" = "IDx"))
It would give:
IDy status salary sex
1 a single 30 M
2 a children 80 M
3 b single 50 F
4 c children 40 M
5 d single 30 F
6 d children 80 F
You can also reorder your columns to get it exactly the way you wanted:
df <- df[, c("IDy", "sex", "status", "salary")]
result:
IDy sex status salary
1 a M single 30
2 a M children 80
3 b F single 50
4 c M children 40
5 d F single 30
6 d F children 80

Fastest way to find nearest value in vector

I have two integer/posixct vectors:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
Now my resulting vector c should contain for each element of vector a the nearest element of b:
c <- c(4,4,4,4,4,6,6,...)
I tried it with apply and which.min(abs(a - b)) but it's very very slow.
Is there any more clever way to solve this? Is there a data.table solution?
As it is presented in this link you can do either:
which(abs(x - your.number) == min(abs(x - your.number)))
or
which.min(abs(x - your.number))
where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.
For example:
x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))
would output:
[1] 21 22
Update: Based on the very kind comment of hendy I have added the following to make it more clear:
Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:
x <- seq(from = 100, to = 10, by = -5)
x
[1] 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10
Now let's find the number closest to 42:
your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]
which would output the "value" we are looking for from the x vector:
[1] 40
Not quite sure how it will behave with your volume but cut is quite fast.
The idea is to cut your vector a at the midpoints between the elements of b.
Note that I am assuming the elements in b are strictly increasing!
Something like this:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)
cut(a, breaks=cuts, labels=b)
# [1] 4 4 4 4 4 6 6 6 10 10 10 10 10 16 16
# Levels: 4 6 10 16
This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).
findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4
So of course you can do something like:
index = findInterval(a, cuts)
b[index]
# [1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.
library(data.table)
a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,merge:=Value]
b=data.table(Value=c(4,6,10,16))
b[,merge:=Value]
setkeyv(a,c('merge'))
setkeyv(b,c('merge'))
Merge_a_b=a[b,roll='nearest']
In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.
For those who would be satisfied with the slow solution:
sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)
Here might be a simple base R option, using max.col + outer:
b[max.col(-abs(outer(a,b,"-")))]
which gives
> b[max.col(-abs(outer(a,b,"-")))]
[1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)
To get around this we can lapply over your a list, and find the closest.
library(DescTools)
lapply(a, function(i) Closest(x = b, a = i))
You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).
To get around this, put either min or max around the result:
lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))
Then unlist the result to get a plain vector :)

R code for repeating value into column

I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')

How to make a code which finds the largest k cells and their locations, when given a table?

I want to know a code which finds the largest k cells and their locations, when given a two dimensional table.
for example, the given two dimensional table is as follows,
table_ex
A B C
F 99 693 515
I 722 583 37
M 186 817 525
the function, which is made by a desirable code, gives the result.
function(table_ex, 2)
817, M B
722, I A
In the case described above, since k=2, the function gives two largest cells and their locations.
You can coerce to data.frame then just sort using order:
getTopCells <- function(tab, n) {
sort_df <- as.data.frame(tab)
sort_df <- sort_df[order(-sort_df$Freq),]
sort_df[1:n, ]
}
Example:
tab <- table(sample(c('A', 'B'), 200, replace=T),
rep(letters[1:5], 40))
# returns:
# a b c d e
# A 20 23 19 21 23
# B 20 17 21 19 17
getTopCells(tab, 3)
# returns:
# Var1 Var2 Freq
# 3 A b 23
# 9 A e 23
# 6 B c 21
A solution using only 'base' and without coercing into a data.frame :
First let's create a table:
set.seed(123)
tab <- table(sample(c('A', 'B'), 200, replace=T),
rep(letters[1:5], 40))
a b c d e
A 15 13 18 20 22
B 25 27 22 20 18
and now:
for (i in 1:nrow(tab)){
cat(dimnames(tab)[[1]][i], which.max(tab[i,]),max(tab[i,]),'\n')
}
A 5 22
B 2 27
I'm using a reshaping approach here. The key is to save your table in a data.frame format and then save your row names as another column in that data.frame. Then you can use something like:
df = read.table(text="
names A B C
F 99 693 515
I 722 583 37
M 186 817 525", header=T)
library(tidyr) # to reshape your dataset
library(dplyr) # to join commands
df %>%
gather(names2,value,-names) %>% # reshape your dataset
arrange(desc(value)) %>% # arrange your value column
slice(1:2) # pick top 2 rows
# names names2 value
# 1 M B 817
# 2 I A 722
PS: In case you don't want to use any packages, or don't want to use data.frames but your original table, I'm sure you'll find some great alternative replies here.

Resources