Internal joins in R - r

How do get data in R for large set of data?
data <- data.frame(col1 = c(A,A,B,B,B,"","","","","","",),
col2 = c(1, 2, 3, 4,2,5,4,7,1,2,3)),
col3=c(5,6,7,10,15,15,10,20,30,40,50))
where
col3=sale number
output result:
From col2 select those row which are not assigned with col1 ex.
col2 row 1,2 which is assigned with A from col1, so I want sale number of col3 excluding which is present in A i.e. 5 & 6 from col3,
similarly, 3,4,2 assigned with B from col1, so I want sale number of col3 excluding which is present in B i.e. 7,10,15 from col3.
Expected result:
col1 col2 col3(SUM OF SALE)
A 1 30
A 2 40
B 3 50
B 4 10
B 2 6
B 2 40

Related

Strange behaviour when selecting columns in data.table in r: Only works when string is given directly, not as a variable

I want to select some columns in a data.frame/data.table. However there seems to be a strange behaviour:
Create dummy data:
df=data.frame(col1=c(1,2),col2=c(11,22),col3=c(111,222))
So our data.frame looks like
col1 col2 col3
1 1 11 111
2 2 22 222
Now I define some variables for the column names:
col1='col1'
col2='col2'
So both df[,c(col1,col2)] and df[,c('col1','col2')] result in
col1 col2
1 1 11
2 2 22
as one would expect.
However if I do the same on the data.table (created by df=data.table(df))
col1 col2 col3
1: 1 11 111
2: 2 22 222
something strange happens. df[,c('col1','col2')] still gets the correct result:
col1 col2
1: 1 11
2: 2 22
but df[,c(col1,col2)] does not work anymore:
[1] 1 2 11 22
Why is that?
It is not a strange behavior as it is already mentioned in the documenation - with = FALSE
df[, c(col1, col2), with = FALSE]
-output
col1 col2
1: 1 11
2: 2 22
According to ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns in j, but also compute on them e.g., x[, a] and x[, sum(a)] returns x$a and sum(x$a) as a vector respectively. x[, .(a, b)] and x[, .(sa=sum(a), sb=sum(b))] returns a two column data.table each, the first simply selecting columns a, b and the second computing their sums.
Other options are
df[, .(col1, col2)]
col1 col2
1: 1 11
2: 2 22
df[, .SD, .SDcols = c(col1, col2)]
col1 col2
1: 1 11
2: 2 22

Add each column to a new data frame

For example, I have a data frame with 4 columns:
col1 col2 col3 col4
I would like to get a new data frame by accumulating each column:
col1 col1+col2 col1+col2+col3 col1+col2+col3+col4
How should I write in R?
In base R, you can calculate row-wise cumsum using apply.
Using #Henry's data :
startdf[] <- t(apply(startdf, 1, cumsum))
startdf
# col1 col2 col3 col4
#1 1 21 321 4321
#2 4 34 234 1234
If this was a matrix then you could use rowCumsums from the matrixStats package
so starting with a dataframe and returning to a dataframe I suppose you could try something like
library(matrixStats)
startdf <- data.frame(col1=c(1,4), col2=c(20,30),
col3=c(300,200), col4=c(4000,1000))
finishdf <- as.data.frame(rowCumsums(as.matrix(startdf)))
to go from
col1 col2 col3 col4
1 1 20 300 4000
2 4 30 200 1000
to
V1 V2 V3 V4
1 1 21 321 4321
2 4 34 234 1234
Base R (not as efficient or clean as Ronak's) [using Henry's data]:
data.frame(Reduce(rbind, Map(cumsum, data.frame(t(startdf)))), row.names = NULL)

how I can add a column with all 1 to my dataframe?

I have a data frame and I want to add a new column with entries 1. how I can do that?
for example
col1. col2
1. 2.
4. 5.
33. 4.
5. 3.
new column
col1. col2. col3
1. 2. 1
4. 5. 1
33. 4. 1
5. 3. 1
df1$col3 <- 1
this should work as well
likewise as above
df1<-data.frame(df1,col3=1)
could also work
Simplest option is to do ?Extract
df1['col3'] <- 1
One of the good things about using [ instead of $ is that we can pass variable identifiers as well
v1 <- 'col3'
df1[v1] <- 1
But, if we do
df1$v1 <- 1
it creates a column with name as 'v1' instead of 'col3'
Other variations without changing the initial object would be
transform(df1, col3 = 1)
cbind(df1, col3 = 1)
NOTE: All of these creates a column appended as the last column
Also, there is a convenient function add_column which can add a column by specifying the position. By default, it creates the column as the last one
library(tibble)
add_column(df1, col3 = 1)
# col1. col2 col3
#1 1 2 1
#2 4 5 1
#3 33 4 1
#4 5 3 1
But, if we need to change it to a specific location, there are arguments
add_column(df1, col3 = 1, .after = 1)
# col1. col3 col2
#1 1 1 2
#2 4 1 5
#3 33 1 4
#4 5 1 3
data
df1 <- structure(list(col1. = c(1, 4, 33, 5), col2 = c(2, 5, 4, 3)),
class = "data.frame", row.names = c(NA,
-4L))

Swap rows, per two rows

Is it possible to swap two adjacent rows with each other, and then move onto the next two rows, and swap their individual rows together? i.e. swap col1 value in row 1 with col 1 value in row2; swap col 1700 value in row 87 with col 1700 value in row 88.
sample data:
col1 col2
row1 a b
row2 b b
row3 c a
row4 d c
My real data has many rows and columns and the data changes each time I go through a loop, so I need a way where I don't refer to specific column names and row names.
The desired result would look like:
col1 col2
row1 b b
row2 a b
row3 d c
row4 c a
Add 1 to the first row in a group of 2, subtract one from the second row in a group of 2:
dat[seq_len(nrow(dat)) + c(1,-1),]
# col1 col2
#row2 b b
#row1 a b
#row4 d c
#row3 c a
This works because of the vector recycling in R:
1:10 + c(1,-1)
#[1] 2 1 4 3 6 5 8 7 10 9
Another way is to create two sequences, one for odd and another for even numbers and combine them alternatively and then use them as row indexes.
df[c(rbind(seq(2, nrow(df), 2), seq(1, nrow(df), 2))),]
# col1 col2
#row2 b b
#row1 a b
#row4 d c
#row3 c a
where
seq(2, nrow(df), 2)
#[1] 2 4
generates even numbered sequence and
seq(1, nrow(df), 2)
#[1] 1 3
generates odd numbered sequence.
We then use rbind and c to alternatively select index from two vectors.

Select values from different columns based on a variable containing column names [duplicate]

This question already has answers here:
Data Table - Select Value of Column by Name From Another Column
(3 answers)
Closed 4 years ago.
I have a data.table like this:
col1 col2 col3 new
1 4 55 col1
2 3 44 col2
3 34 35 col2
4 44 87 col3
I want to populate another column matched_value that contains the values from the respective column names given in the new column:
col1 col2 col3 new matched_value
1 4 55 col1 1
2 3 44 col2 3
3 34 35 col2 34
4 44 87 col3 87
E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.
How can I do this efficiently in R on a very large data.table?
An excuse to use the obscure .BY:
DT[, newval := .SD[[.BY[[1]]]], by=new]
col1 col2 col3 new newval
1: 1 4 55 col1 1
2: 2 3 44 col2 3
3: 3 34 35 col2 34
4: 4 44 87 col3 87
How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.
Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by #David, the two ways are equally fast.
We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.
df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
# col1 col2 col3 new matched_value
#1 1 4 55 col1 1
#2 2 3 44 col2 3
#3 3 34 35 col2 34
#4 4 44 87 col3 87
NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.
setDF(df1) #to convert to 'data.frame'.
Benchmarks
set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE),
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
# user system elapsed
# 2.54 0.37 2.92

Resources