Add data from a data table to another using values of a column - r

I know the question is confusing, but I hope the example will make it simple.
I have two tables:
x y
1 23
2 34
3 76
4 31
&
x y
1 78
3 51
5 54
I need to add the y columns based on x values. I can do it using loops, but don't want to. It will be better if the solution uses base, dplyr, data.table functions as I am most familiar with those, I am okay with apply family of functions as well. The output should look like this:
x y
1 101
2 34
3 127
4 31
5 54

The basic idea is to combine the two dataset, group by x and summarize y with sum and there are a couple of ways to do it:
data.table:
rbind(dtt1, dtt2)[, .(y = sum(y)), by = x]
# x y
# 1: 1 101
# 2: 2 34
# 3: 3 127
# 4: 4 31
# 5: 5 54
base R aggregate:
aggregate(y ~ x, rbind(dtt1, dtt2), FUN = sum)
dplyr:
rbind(dtt1, dtt2) %>% group_by(x) %>% summarize(y = sum(y))
The data:
library(data.table)
dtt1 <- fread('x y
1 23
2 34
3 76
4 31')
dtt2 <- fread('x y
1 78
3 51
5 54')

Related

Populate column by adding onto row above using lag() in R

I want to populate an existing column with values that continually add onto the row above.
This is easy in Excel, but I haven't figured out a good way to automate it in R.
If we had 2 columns in Excel, A and B, we want cell B2 to =B1+A2, and cell B3 would = B2+A3. How can I do this in R?
#example dataframe
df <- data.frame(A = 0:9, B = c(50,0,0,0,0,0,0,0,0,0))
#desired output
desired <- data.frame(A = 0:9, B = c("NA",51,53,56,60,65,71,78,86,95))
I tried using the lag() function, but it didn't give the correct output.
df <- df %>%
mutate(B = B + lag(A))
So I made a for loop that works, but I feel like there's a better solution.
for(i in 2:nrow(df)){
df$B[i] <- df$B[i-1] + df$A[i]
}
Eventually, I want to iterate this function over every n rows of the whole dataframe, essentially so the summation resets every n rows. (any tips on how to do that would be greatly appreciated!)
This might be close to what you need, and uses tidyverse. Specifically, it uses accumulate from purrr.
Say you want to reset to zero every n rows, you can also use group_by ahead of time.
It was not entirely clear how you'd like to handle the first row; here, it will just use the first B value and ignore the first A value, which looked similar to what you had in the post.
n <- 5
library(tidyverse)
df %>%
group_by(grp = ceiling(row_number() / n)) %>%
mutate(B = accumulate(A[-1], sum, .init = B[1]))
Output
A B grp
<int> <dbl> <dbl>
1 0 50 1
2 1 51 1
3 2 53 1
4 3 56 1
5 4 60 1
6 5 0 2
7 6 6 2
8 7 13 2
9 8 21 2
10 9 30 2
cumsum() can be used to get the result you need.
df$B <- cumsum(df$B + df$A)
df
A B
1 0 50
2 1 51
3 2 53
4 3 56
5 4 60
6 5 65
7 6 71
8 7 78
9 8 86
10 9 95

How to replace NAs with values from another column in data.table (Example given)?

DT is data.table and I want to replace NAs with values from visits column and Expected_DT is desired DT.
DT<-data.table(name=c("x","x","x","x"),hour=1:4,count=c(NA,45,56,78),visits=c(14,45,56,78))
name hour count visits
1: x 1 NA 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
This is what I want
Expected_DT<-data.table(name=c("x","x","x","x"),hour=1:4,count=c(14,45,56,78),visits=c(14,45,56,78))
name hour count visits
1: x 1 14 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
A few options:
1) using fcoalesce
DT[, count := fcoalesce(visits, count)]
2) using is.na:
DT[is.na(count), count := visits]
3) using fifelse:
DT[, count := fifelse(is.na(count), visits, count)]
4) using set and using sindri_baldur's comment on [[ for faster indexing:
ix <- DT[is.na(count), which=TRUE]
set(DT, ix, "count", DT[["visits"]][ix])
Solution using data.table:
DT[is.na(count), count:=visits]
DT
Returns:
name hour count visits
1: x 1 14 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
Some base R solutions
using ifelse
DT <- within(DT, count <- ifelse(is.na(count),visits,count))
using rowSums
DT <- within(DT, count <- rowSums(cbind(is.na(count)*visits,count),na.rm = TRUE))
And here is a dplyr version to be complete for other users:
library(dplyr)
DT %>%
mutate(count = if_else(is.na(count), visits, count))
name hour count visits
1 x 1 14 14
2 x 2 45 45
3 x 3 56 56
4 x 4 78 78

R: bind columns after lapply() the poly() function

I want to add columns containing polynomials to a dataframe (DF).
Background: I need to use polynomials in a glmnet setting. I cannot call poly() directly in the glmnet() estimation command. I get an error, likely because my “Xtrain” data contain factors.
My workaround is to slice my Xtrain DF in two pieces, one containing all factors (for which no transformation is needed) and one containing the rest, viz. the numeric columns.
Now I want to add columns with polynomials to my numeric DF.
Here is a minimal example of my problem.
# Some data
x <- 1:10
y <- 11:20
df = as.data.frame(cbind(x,y))
# Looks like this
x y
1 1 11
2 2 12
3 3 13
# Now I generate polys
lapply(df, function(i) poly(i, 2, raw=T)[,1:2])
However, I cannot figure out how to "cbind" the results. What I want to have in the end is a DF in which x, x^2, y, y^2, are contained. Order does not matter. However, ideally I would also have column labels (to identify the polys). For instance like this:
x x2 y y2
1 1 1 11 121
2 2 4 12 144
3 3 9 13 169
Thank you...
Cheers!
Another option is
as.data.frame(lapply(df, function(i) poly(i, 2, raw=T)[,1:2]))
# x.1 x.2 y.1 y.2
#1 1 1 11 121
#2 2 4 12 144
#3 3 9 13 169
# ...
As mentioned by #gpier and #akrun already, you might use ^ instead of poly
n <- 2
df[paste(names(df), n, sep = "_")] <- df^n
df
We can use do.call
do.call(cbind, lapply(df, function(i) poly(i, 2, raw=T)[,1:2]))
If we just need squares
cbind(df, as.matrix(df)^2)
poly is not the right function if you need squares. Try
cbind(df,lapply(df, function(x) x^2))
x y x y
1 1 11 1 121
2 2 12 4 144
3 3 13 9 169
4 4 14 16 196
5 5 15 25 225
6 6 16 36 256
7 7 17 49 289
8 8 18 64 324
9 9 19 81 361
10 10 20 100 400
EDIT: indeed you don't even need lapply, you could just use cbind(df, df^2)

Is it possible in R to merge and group (by a column), the dataframe with a single function or in a single step?

I am new to the R programming, so wanted to learn that if it is possible to perform merging and grouping of the data with a single function or within a single step in R.
I'm not sure if I've understood your question correctly. It's possible to group and merge data via the aggregate function:
df <- data.frame(a=1:40, b=rbinom(40, 10, 0.5), n=rnorm(40), p=rpois(40, lambda=4), group=gl(4,10), even=rep(c(1,2),20))
require(plyr)
aggregate(b ~ group, df, sum) #aggregate/sum over group
aggregate(b ~ group + even, df, sum) #aggregate/sum over group & even
Results:
> aggregate(b ~ group, df, sum)
group b
1 1 51
2 2 49
3 3 49
4 4 47
> aggregate(b ~ group + even, df, sum)
group even b
1 1 1 27
2 2 1 23
3 3 1 25
4 4 1 23
5 1 2 24
6 2 2 26
7 3 2 24
8 4 2 24

Assigning labels in R based on ID?

I have a data frame as follows:
DF<-data.frame(a=c(1,1,1,2,2,2,3,3,4,4),b=c(43,23,45,65,43,23,65,76,87,4))
a b
1 43
1 23
1 45
2 65
2 43
2 23
3 65
3 76
4 87
4 4
I want to set a flag like this:
a b flag
1 43 A
1 23 B
1 45 C
2 65 A
2 43 B
2 23 C
3 65 A
3 76 B
4 87 A
4 4 B
How can I get this done in R?
Using dplyr
library(dplyr)
DF %>% group_by(a) %>% mutate(flag=LETTERS[row_number()])
Using data.table(HT to #David Arenberg)
library(data.table)
setDT(DF)[, flag := LETTERS[1:.N], a]
And a soon to be vintage solution (by #Roman Luštrik)
do.call("c", sapply(rle(DF$a)$lengths, FUN = function(x) LETTERS[1:x]))
Addendum
#akrun suggested following extension of the LETTERS to address the immediate question arose "What if there is more than 26 groups?" (by #James)
Let <- unlist(sapply(1:3, function(i) do.call(paste0,expand.grid(rep(list(LETTERS),i)))))
All above codes remain fully functional, when LETTERS replaced by Let.
I'll thrown in one more in base R:
transform(DF, flag = LETTERS[ave(a,a,FUN=seq_along)])

Resources