How do get data in R for large set of data?
data <- data.frame(col1 = c(A,A,B,B,B,"","","","","","",),
col2 = c(1, 2, 3, 4,2,5,4,7,1,2,3)),
col3=c(5,6,7,10,15,15,10,20,30,40,50))
where
col3=sale number
output result:
From col2 select those row which are not assigned with col1 ex.
col2 row 1,2 which is assigned with A from col1, so I want sale number of col3 excluding which is present in A i.e. 5 & 6 from col3,
similarly, 3,4,2 assigned with B from col1, so I want sale number of col3 excluding which is present in B i.e. 7,10,15 from col3.
Expected result:
col1 col2 col3(SUM OF SALE)
A 1 30
A 2 40
B 3 50
B 4 10
B 2 6
B 2 40
Related
I want to select some columns in a data.frame/data.table. However there seems to be a strange behaviour:
Create dummy data:
df=data.frame(col1=c(1,2),col2=c(11,22),col3=c(111,222))
So our data.frame looks like
col1 col2 col3
1 1 11 111
2 2 22 222
Now I define some variables for the column names:
col1='col1'
col2='col2'
So both df[,c(col1,col2)] and df[,c('col1','col2')] result in
col1 col2
1 1 11
2 2 22
as one would expect.
However if I do the same on the data.table (created by df=data.table(df))
col1 col2 col3
1: 1 11 111
2: 2 22 222
something strange happens. df[,c('col1','col2')] still gets the correct result:
col1 col2
1: 1 11
2: 2 22
but df[,c(col1,col2)] does not work anymore:
[1] 1 2 11 22
Why is that?
It is not a strange behavior as it is already mentioned in the documenation - with = FALSE
df[, c(col1, col2), with = FALSE]
-output
col1 col2
1: 1 11
2: 2 22
According to ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns in j, but also compute on them e.g., x[, a] and x[, sum(a)] returns x$a and sum(x$a) as a vector respectively. x[, .(a, b)] and x[, .(sa=sum(a), sb=sum(b))] returns a two column data.table each, the first simply selecting columns a, b and the second computing their sums.
Other options are
df[, .(col1, col2)]
col1 col2
1: 1 11
2: 2 22
df[, .SD, .SDcols = c(col1, col2)]
col1 col2
1: 1 11
2: 2 22
For example, I have a data frame with 4 columns:
col1 col2 col3 col4
I would like to get a new data frame by accumulating each column:
col1 col1+col2 col1+col2+col3 col1+col2+col3+col4
How should I write in R?
In base R, you can calculate row-wise cumsum using apply.
Using #Henry's data :
startdf[] <- t(apply(startdf, 1, cumsum))
startdf
# col1 col2 col3 col4
#1 1 21 321 4321
#2 4 34 234 1234
If this was a matrix then you could use rowCumsums from the matrixStats package
so starting with a dataframe and returning to a dataframe I suppose you could try something like
library(matrixStats)
startdf <- data.frame(col1=c(1,4), col2=c(20,30),
col3=c(300,200), col4=c(4000,1000))
finishdf <- as.data.frame(rowCumsums(as.matrix(startdf)))
to go from
col1 col2 col3 col4
1 1 20 300 4000
2 4 30 200 1000
to
V1 V2 V3 V4
1 1 21 321 4321
2 4 34 234 1234
Base R (not as efficient or clean as Ronak's) [using Henry's data]:
data.frame(Reduce(rbind, Map(cumsum, data.frame(t(startdf)))), row.names = NULL)
I have a data frame and I want to add a new column with entries 1. how I can do that?
for example
col1. col2
1. 2.
4. 5.
33. 4.
5. 3.
new column
col1. col2. col3
1. 2. 1
4. 5. 1
33. 4. 1
5. 3. 1
df1$col3 <- 1
this should work as well
likewise as above
df1<-data.frame(df1,col3=1)
could also work
Simplest option is to do ?Extract
df1['col3'] <- 1
One of the good things about using [ instead of $ is that we can pass variable identifiers as well
v1 <- 'col3'
df1[v1] <- 1
But, if we do
df1$v1 <- 1
it creates a column with name as 'v1' instead of 'col3'
Other variations without changing the initial object would be
transform(df1, col3 = 1)
cbind(df1, col3 = 1)
NOTE: All of these creates a column appended as the last column
Also, there is a convenient function add_column which can add a column by specifying the position. By default, it creates the column as the last one
library(tibble)
add_column(df1, col3 = 1)
# col1. col2 col3
#1 1 2 1
#2 4 5 1
#3 33 4 1
#4 5 3 1
But, if we need to change it to a specific location, there are arguments
add_column(df1, col3 = 1, .after = 1)
# col1. col3 col2
#1 1 1 2
#2 4 1 5
#3 33 1 4
#4 5 1 3
data
df1 <- structure(list(col1. = c(1, 4, 33, 5), col2 = c(2, 5, 4, 3)),
class = "data.frame", row.names = c(NA,
-4L))
Is it possible to swap two adjacent rows with each other, and then move onto the next two rows, and swap their individual rows together? i.e. swap col1 value in row 1 with col 1 value in row2; swap col 1700 value in row 87 with col 1700 value in row 88.
sample data:
col1 col2
row1 a b
row2 b b
row3 c a
row4 d c
My real data has many rows and columns and the data changes each time I go through a loop, so I need a way where I don't refer to specific column names and row names.
The desired result would look like:
col1 col2
row1 b b
row2 a b
row3 d c
row4 c a
Add 1 to the first row in a group of 2, subtract one from the second row in a group of 2:
dat[seq_len(nrow(dat)) + c(1,-1),]
# col1 col2
#row2 b b
#row1 a b
#row4 d c
#row3 c a
This works because of the vector recycling in R:
1:10 + c(1,-1)
#[1] 2 1 4 3 6 5 8 7 10 9
Another way is to create two sequences, one for odd and another for even numbers and combine them alternatively and then use them as row indexes.
df[c(rbind(seq(2, nrow(df), 2), seq(1, nrow(df), 2))),]
# col1 col2
#row2 b b
#row1 a b
#row4 d c
#row3 c a
where
seq(2, nrow(df), 2)
#[1] 2 4
generates even numbered sequence and
seq(1, nrow(df), 2)
#[1] 1 3
generates odd numbered sequence.
We then use rbind and c to alternatively select index from two vectors.
This question already has answers here:
Data Table - Select Value of Column by Name From Another Column
(3 answers)
Closed 4 years ago.
I have a data.table like this:
col1 col2 col3 new
1 4 55 col1
2 3 44 col2
3 34 35 col2
4 44 87 col3
I want to populate another column matched_value that contains the values from the respective column names given in the new column:
col1 col2 col3 new matched_value
1 4 55 col1 1
2 3 44 col2 3
3 34 35 col2 34
4 44 87 col3 87
E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.
How can I do this efficiently in R on a very large data.table?
An excuse to use the obscure .BY:
DT[, newval := .SD[[.BY[[1]]]], by=new]
col1 col2 col3 new newval
1: 1 4 55 col1 1
2: 2 3 44 col2 3
3: 3 34 35 col2 34
4: 4 44 87 col3 87
How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.
Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by #David, the two ways are equally fast.
We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.
df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
# col1 col2 col3 new matched_value
#1 1 4 55 col1 1
#2 2 3 44 col2 3
#3 3 34 35 col2 34
#4 4 44 87 col3 87
NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.
setDF(df1) #to convert to 'data.frame'.
Benchmarks
set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE),
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
# user system elapsed
# 2.54 0.37 2.92