For example, I have a data frame with 4 columns:
col1 col2 col3 col4
I would like to get a new data frame by accumulating each column:
col1 col1+col2 col1+col2+col3 col1+col2+col3+col4
How should I write in R?
In base R, you can calculate row-wise cumsum using apply.
Using #Henry's data :
startdf[] <- t(apply(startdf, 1, cumsum))
startdf
# col1 col2 col3 col4
#1 1 21 321 4321
#2 4 34 234 1234
If this was a matrix then you could use rowCumsums from the matrixStats package
so starting with a dataframe and returning to a dataframe I suppose you could try something like
library(matrixStats)
startdf <- data.frame(col1=c(1,4), col2=c(20,30),
col3=c(300,200), col4=c(4000,1000))
finishdf <- as.data.frame(rowCumsums(as.matrix(startdf)))
to go from
col1 col2 col3 col4
1 1 20 300 4000
2 4 30 200 1000
to
V1 V2 V3 V4
1 1 21 321 4321
2 4 34 234 1234
Base R (not as efficient or clean as Ronak's) [using Henry's data]:
data.frame(Reduce(rbind, Map(cumsum, data.frame(t(startdf)))), row.names = NULL)
Related
I want to select some columns in a data.frame/data.table. However there seems to be a strange behaviour:
Create dummy data:
df=data.frame(col1=c(1,2),col2=c(11,22),col3=c(111,222))
So our data.frame looks like
col1 col2 col3
1 1 11 111
2 2 22 222
Now I define some variables for the column names:
col1='col1'
col2='col2'
So both df[,c(col1,col2)] and df[,c('col1','col2')] result in
col1 col2
1 1 11
2 2 22
as one would expect.
However if I do the same on the data.table (created by df=data.table(df))
col1 col2 col3
1: 1 11 111
2: 2 22 222
something strange happens. df[,c('col1','col2')] still gets the correct result:
col1 col2
1: 1 11
2: 2 22
but df[,c(col1,col2)] does not work anymore:
[1] 1 2 11 22
Why is that?
It is not a strange behavior as it is already mentioned in the documenation - with = FALSE
df[, c(col1, col2), with = FALSE]
-output
col1 col2
1: 1 11
2: 2 22
According to ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns in j, but also compute on them e.g., x[, a] and x[, sum(a)] returns x$a and sum(x$a) as a vector respectively. x[, .(a, b)] and x[, .(sa=sum(a), sb=sum(b))] returns a two column data.table each, the first simply selecting columns a, b and the second computing their sums.
Other options are
df[, .(col1, col2)]
col1 col2
1: 1 11
2: 2 22
df[, .SD, .SDcols = c(col1, col2)]
col1 col2
1: 1 11
2: 2 22
I tried every solution but my problem is still there. I have a big df (20rows*400cols) - for each row I want to count how many columns have a value of more than 16.
The first col is factor and the rest of the columns are integers.
my df:
col1 col2 col3 col4
abc 2 16 17
def 4 2 4
geh 50 60 73
desired output should be:
col1 col2 col3 col4 count
abc 2 16 17 1
def 4 2 4 0
geh 50 60 73 3
I tried df$morethan16 <- rowSums(df[,-1] > 16) but then I get NA in the count column.
We may need na.rm to take care of NA elements as >/</== returns NA wherever there are NA elements
df$morethan16 <- rowSums(df[,-1] > 16, na.rm = TRUE)
If we still get NA, check the class of the columns. The above code works only if the columns are numeric. Convert to numeric class automatically with type.convert (based on the values of the column)
df <- type.convert(df, as.is = TRUE)
check the structure
str(df)
If it is still not numeric, some values in the column may be character elements that prevents it from conversion to numeric. Force the columns to numeric with as.numeric. If those are factor columns, do as.character first
df[-1] <- lapply(df[-1], function(x) as.numeric(as.character(x)))
Here is another option using crossprod
df$count <- c(crossprod(rep(1, ncol(df[-1])), t(df[-1] > 16)))
which gives
col1 col2 col3 col4 count
1 abc 2 16 17 1
2 def 4 2 4 0
3 geh 50 60 73 3
This question already has answers here:
Data Table - Select Value of Column by Name From Another Column
(3 answers)
Closed 4 years ago.
I have a data.table like this:
col1 col2 col3 new
1 4 55 col1
2 3 44 col2
3 34 35 col2
4 44 87 col3
I want to populate another column matched_value that contains the values from the respective column names given in the new column:
col1 col2 col3 new matched_value
1 4 55 col1 1
2 3 44 col2 3
3 34 35 col2 34
4 44 87 col3 87
E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.
How can I do this efficiently in R on a very large data.table?
An excuse to use the obscure .BY:
DT[, newval := .SD[[.BY[[1]]]], by=new]
col1 col2 col3 new newval
1: 1 4 55 col1 1
2: 2 3 44 col2 3
3: 3 34 35 col2 34
4: 4 44 87 col3 87
How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.
Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by #David, the two ways are equally fast.
We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.
df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
# col1 col2 col3 new matched_value
#1 1 4 55 col1 1
#2 2 3 44 col2 3
#3 3 34 35 col2 34
#4 4 44 87 col3 87
NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.
setDF(df1) #to convert to 'data.frame'.
Benchmarks
set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE),
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
# user system elapsed
# 2.54 0.37 2.92
I think this is pretty simple. I have a dataframe called df. It has 51 columns. The rows in each column contains random integers. All I want to do as a loop is add all the integers in all the rows of each column and then store the output for each of the columns in a seperate list or dataframe.
The df looks like this
Col1 col2 col3 col4
34 12 33 67
22 1 56 66
Etc
The output I want is:
Col1 col2 col3 col4
56 13 89 133
I do want to do this as a loop as I want to apply what I've learnt here to a more complex script with similar output and I need to do it quick- can't quite master functions as yet...
You can use the built in function colSums for this:
> df <- data.frame(col1 = c(1,2,3), col2 = c(2,3,4))
> colSums(df)
col1 col2
6 9
Another option using a loop:
# Create the result data frame
> res <- data.frame(df[1,], row.names = c('count'))
# Populate the results
> for(n in 1:ncol(df)) { res[colnames(df)[n]] <- sum(df[n]) }
col1 col2
6 9
If you really want to use a loop over a vectorized solution, use apply to loop over columns (second argument equal to 2, 1 is to loop over rows), by mentioning the function you want (here sum):
df = data.frame(col1=1:3,col2=2:4,col3=3:5)
apply(df, 2, sum)
#col1 col2 col3
# 6 9 12
I have a data frame with a number of columns, and would like to output a separate column for each with the length of each row in it.
I am trying to iterate through the column names, and for each column output a corresponding column with '_length' attached.
For example col1 | col2 would go to col1 | col2 | col1_length | col2_length
The code I am using is:
df <- data.frame(col1 = c("abc","abcd","a","abcdefg"),col2 = c("adf qqwe","d","e","f"))
for(i in names(df)){
df$paste(i,'length',sep="_") <- str_length(df$i)
}
However this throws and error:
invalid function in complex assignment.
Am I able to use loops in this way in R?
You need to use [[, the programmatic equivalent of $. Otherwise, for example, when i is col1, R will look for df$i instead of df$col1.
for(i in names(df)){
df[[paste(i, 'length', sep="_")]] <- str_length(df[[i]])
}
You can use lapply to pass each column to str_length, then cbind it to your original data.frame...
library(stringr)
out <- lapply( df , str_length )
df <- cbind( df , out )
# col1 col2 col1 col2
#1 abc adf qqwe 3 8
#2 abcd d 4 1
#3 a e 1 1
#4 abcdefg f 7 1
With dplyr and stringr you can use mutate_all:
> df %>% mutate_all(funs(length = str_length(.)))
col1 col2 col1_length col2_length
1 abc adf qqwe 3 8
2 abcd d 4 1
3 a e 1 1
4 abcdefg f 7 1
For the sake of completeness, there is also a data.table solution:
library(data.table)
result <- setDT(df)[, paste0(names(df), "_length") := lapply(.SD, stringr::str_length)]
result
# col1 col2 col1_length col2_length
#1: abc adf qqwe 3 8
#2: abcd d 4 1
#3: a e 1 1
#4: abcdefg f 7 1