I have been struggling with a task in R for some time, which seems to be easy.
suppose this is my sample data:
df <- data.frame(a=c(2,2,7),b=c(1,4,3),c=c(9,5,3))
v <- c(1,2,3)
now I would like to multiply each column by the corresponding vector element e.g. first column by v[1], second column by v[2]etc..
expected output:
a b c
1 2 2 27
2 2 8 15
3 7 6 9
The target data is much larger and consists of integers and floating point numbers.
Thank you in advance!
You can use sweep:
sweep(df, 2, v, FUN="*")
Second option is mapply:
mapply(`*`, df, v)
Or with transposing:
t(t(df)*v)
You can try col
> v[col(df)] * df
a b c
1 2 2 27
2 2 8 15
3 7 6 9
apply(df, 1, function(x) x * v) |> t()
or
t(apply(df, 1, function(x) x * v))
Related
I have a dataframe in R and I want to create a single numeric vector by splitting all of the character values in a specific column and then appending them to the vector or list. The values in the column are all comma-separated numbers and there are rows with missing values or NA.
Current data
id col
1 2,6,10
2 NA
3 5, 10
4 1
Final vector
# v <- c(2, 6, 10, 5, 10, 1)
# v
[1] 2 6 10 5 10 1
I'm able to do this by iterating through all the values in the column but I know this isn't the most efficient way since R is made to work easily with vectors. Is there a better way to do this?
v <- c()
for(val in df$col){
if(!is.na(val)){
ints <- as.numeric(unlist(strsplit(val, ",")))
v <- c(v, ints)
}
}
You already have the answer in your code since all the functions you are using are vectorised.
v <- as.numeric(na.omit(unlist(strsplit(df$col, ','))))
v
#[1] 2 6 10 5 10 1
Does this work:
library(dplyr)
library(tidyr)
df %>% separate_rows(col) %>% na.omit() %>% pull(col) %>% as.numeric() -> v
v
[1] 2 6 10 5 10 1
Data used:
df
# A tibble: 4 x 2
id col
<dbl> <chr>
1 1 2,6,10
2 2 NA
3 3 5, 10
4 4 1
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
I have a data frame x:
begin end
1 1 3
2 5 6
3 11 18
and a vector v <- c(1,2,5,9,10,11,17,20)
I'd like to find all values from vector that are elements of any of interval from data frame. So i would like to get a vector c(1,2,5,11,17). How is it possible?
To get row-wise values, use apply on MARGIN 1 with intersect
apply(df, 1, function(a) intersect(v, a[1]:a[2]))
#[[1]]
#[1] 1 2
#[[2]]
#[1] 5
#[[3]]
#[1] 11 17
OR unlist to get a vector
unlist(apply(df, 1, function(a) intersect(v, a[1]:a[2])))
#OR
intersect(v, unlist(apply(df, 1, function(a) a[1]:a[2]))) #as commented by akrun
#[1] 1 2 5 11 17
We can use Map to get the sequence between corresponding, begin/end values in a list, unlist the list and use intersect to get the elements common in both the vectors
intersect(unlist(Map(`:`, x$begin, x$end)), v)
#[1] 1 2 5 11 17
I would like to process all rows in data frame df by applying function f to every row. As function f returns numeric vector with two elements I would like to assign individual elements to new columns in df.
Sample df, trivial function f returning two elements and my trial with using apply
df <- data.frame(a = 1:3, b = 3:5)
f <- function (a, b) {
c(a + b, a * b)
}
df[, c('apb', 'amb')] <- apply(df, 1, function(x) f(a = x[1], b = x[2]))
This does not work results are assigned by columns:
> df
a b apb amb
1 1 3 4 8
2 2 4 3 8
3 3 5 6 15
You could also use Reduce instead of apply as it is generally more efficient. You just need to slightly modify your function to use cbind instead of c
f <- function (a, b) {
cbind(a + b, a * b) # midified to use `cbind` instead of `c`
}
df[c('apb', 'amb')] <- Reduce(f, df)
df
# a b apb amb
# 1 1 3 4 3
# 2 2 4 6 8
# 3 3 5 8 15
Note: This will only work nicely if you have only two columns (as in your example), thus if you have more columns in you data set, run this only on a subset
You need to transpose apply results to get what you want :
df[, c('apb', 'amb')] <- t(apply(df, 1, function(x) f(a = x[1], b = x[2])))
> df
a b apb amb
1 1 3 4 3
2 2 4 6 8
3 3 5 8 15
Assume you have a data frame like this:
df <- data.frame(Nums = c(1,2,3,4,5,6,7,8,9,10), Cum.sums = NA)
> df
Nums Cum.sums
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
and you want an output like this:
Nums Cum.sums
1 1 0
2 2 0
3 3 0
4 4 3
5 5 5
6 6 7
7 7 9
8 8 11
9 9 13
10 10 15
The 4. element of the column Cum.sum is the sum of 1 and 2, the 5. element of the Column Cum.sum is the sum of 2 and 3 and so on...
This means, I would like to build the cumulative sum of the first row and save it in the second row. However I don't want the normal cumulative sum but the sum of the element 2 rows above the current row plus the element 3 rows above the current row.
I allready tried to play a little bit around with the sum and cumsum function but I failed.
Any ideas?
Thanks!
You could use the embed function to create the appropriate lags, rowSums to sum, then lag appropriately (I used head).
df$Cum.sums[-(1:3)] <- head(rowSums(embed(df$Nums,2)),-2)
You don't need any special function, just use normal vector operations (these solutions are all equivalent):
df$Cum.sums[-(1:3)] <- head(df$Nums, -3) + head(df$Nums[-1], -2)
or
with(df, Cum.sums[-(1:3)] <- head(Nums, -3) + head(Nums[-1], -2))
or
df$Cum.sums[-(1:3)] <- df$Nums[1:(nrow(df)-3)] + df$Nums[2:(nrow(df)-2)]
I believe the first 3 sums SHOULD be NA, not 0, but if you prefer zeroes, you can initialize the sums first:
df$Cum.sums <- 0
Another solution, elegant and general, using matrix multiplication - and so very inefficient for large data. So it's not much practical, though a nice excercise:
len <- nrow(df)
sr <- 2 # number of rows to sum
lag <- 3
mat <- matrix(
head(c(
rep(0, lag * len),
rep(rep(1:0, c(sr, len - sr + 1)), len)
), len * len),
nrow = 10, byrow = TRUE
)
mat %*% df$Nums