User Defined Function to create and sum a subset in R - r

I need help defining a function that creates a vector in a database where, for each row, the function looks at another column in that database, searches for that value in a designated column of a separate database, creates a subset of that second database consisting of all matching rows, sums a separate column of that new subset, and returns that value to the corresponding row of the new column in the original database.
In other words, I have a data frame that looks something like this:
ID <- c('a', 'b', 'c', 'd', 'e')
M <- 20:39
df <- data.frame(cbind(ID, M))
df$M <- as.numeric(df$M)
> df
ID M
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 a 6
7 b 7
8 c 8
9 d 9
10 e 10
11 a 11
12 b 12
13 c 13
14 d 14
15 e 15
16 a 16
17 b 17
18 c 18
19 d 19
20 e 20
> str(df)
'data.frame': 20 obs. of 2 variables:
$ ID: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 1 2 3 4 5 ...
$ M : num 1 2 3 4 5 6 7 8 9 10 ...
I would like to create a new data frame, Z, such that Z <- data.frame(cbind(X, Y)) where:
X <- as.character(unique(df$ID))
> X
[1] "a" "b" "c" "d" "e"
and Y is a vector of the sum of all a's, sum of all b's, sum of all c's, etc...
So, Y should be equal to c(34, 38, 42, 46, 50) and my final result should be:
> Z
X Y
1 a 34
2 b 38
3 c 42
4 d 46
5 e 50
> str(Z)
'data.frame': 5 obs. of 2 variables:
$ X: chr "a" "b" "c" "d" ...
$ Y: num 34 38 42 46 50
To do this, I've tried first turning X into a data frame (is it easier to work with as a data table?):
> Z <- data.frame(X)
> Z
X
1 a
2 b
3 c
4 d
5 e
> str(Z)
'data.frame': 5 obs. of 1 variable:
$ X: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
and then defining Y as Z$Y <- sum(df[df$ID == Z$X, 2]) but I don't get unique values:
> Z
X Y
1 a 210
2 b 210
3 c 210
4 d 210
5 e 210
I've also tried defining the function f1() like so:
f1 <- function(v, w, x, y, z){sum(v[v$w == x$y, z])}
but that gets me:
> f1(df, 'ID', Z, 'X', 'M')
[1] 0
I have found a function from another post on this forum that does something similar:
f1 <- function(df, cols, match_with, to_x = 50){
df[cols] <- lapply(df[cols], function(i)
ifelse(grepl(to_x, match_with, fixed = TRUE), 'MID',
i))
return(df)
}
This looks for the value "50" in the match_with column and returns the value "MID" to that row of the column designated by cols, provided both columns in the same designated data base df. So, I would need to replace to_x = 50 with something that, instead of looking for the fixed value "50," looks for whatever value is in the column Z$X and, instead of returning the fixed value "MID," returns the sum of the values df[df$ID == Z$X, df$M]. I've attempted these changes myself by writing variations of the following:
f1 <- function(df, cols, match_with, to_x = df[ , 1], x){
df[cols] <- lapply(df[cols], function(i)
ifelse(grepl(to_x, match_with, fixed = TRUE), sum(x),
i))
return(df)
}
but, so far, none of my variations have produced the desired results. This one gave me:
> f1(Z, df, cols = c('Y'), match_with = df$ID, x = df$M)
X Y
1 a 210
2 b 210
3 c 210
4 d 210
5 e 210
Warning messages:
1: In grepl(to_x, match_with, fixed = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
2: In `[<-.data.frame`(`*tmp*`, cols, value = list(Y = c(210, 210, :
replacement element 1 has 20 rows to replace 5 rows
It seems to be summing the entirety of df$M instead of the subsets where df$ID == Z$X. In other variations it seemed to have problems referencing a column in a second data frame.
I am somewhat new to R and have almost no experience writing user-defined functions (as you probably could tell by this question). Any help would be very much appreciated!

Nevermind ya'll, I think I got it!
> f1 <- function(col1, col2, df2, to_add){
+ lapply(col1, function(i){
+ df2$x <- grepl(i, col2, fixed = TRUE)
+ df3 <- df2[df2$x == TRUE, to_add]
+ sum(df3, na.rm = TRUE)
+ })}
> Z$Y <- f1(Z$X, df$ID, df, c('M'))
> Z
X Y
1 a 34
2 b 38
3 c 42
4 d 46
5 e 50

Related

operating between columns and classifing values per groups R

I try to obtain percentages grouping values regarding one variable.
For this I used sapply to obtain the percentage of each column regarding another one, but I dont know how to group these values by type (another variable)
x <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
x
A B C type yes
1 0 0 1 x 0
2 0 1 0 x 0
3 1 0 1 x 1
4 1 1 1 y 1
5 1 0 0 y 0
6 1 1 0 y 1
7 1 1 1 x 1
I need to obtaing the next value (percentage): A==1&yes==1/A==1, and for this I use the next code:
result <- as.data.frame(sapply(x[,1:3],
function(i) (sum(i & x$yes)/sum(i))*100))
result
sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A 80
B 75
C 75
Now I need to obtain the same math operation but taking into account the varible "type". It means, obtaing the same percentage but discriminating it by type. So, my expected table was:
type sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A x 40
A y 40
B x 25
B y 50
C x 50
C y 25
In the example it's possible to observe that, by letters, the percentage sum is the same value that the obtained in the first result, just here is discriminated by type.
thanks a lot.
You can do the following using data.table:
Code
setDT(df)
cols = c('A', 'B', 'C')
mat = df[yes == 1, lapply(.SD, function(x){
100 * sum(x)/df[, lapply(.SD, sum), .SDcols = cols][[substitute(x)]]
# Here, the numerator is sum(x | yes == 1) for x == columns A, B, C
# If we look at the denominator, it equals sum(x) for x == columns A, B, C
# The reason why we need to apply substitute(x) is because df[, lapply(.SD, sum)]
# generates a list of column sums, i.e. list(A = sum(A), B = sum(B), ...).
# Hence, for each x in the column names we must subset the list above using [[substitute(x)]]
# Ultimately, the operation equals sum(x | yes == 1)/sum(x) for A, B, C.
}), .(type), .SDcols = cols]
# '.(type)' simply means that we apply this for each type group,
# i.e. once for x and once for y, for each ABC column.
# The dot is just shorthand for 'list()'.
# .SDcols assigns the subset that I want to apply my lapply statement onto.
Result
> mat
type A B C
1: x 40 25 50
2: y 40 50 25
Long format (your example)
> melt(mat)
type variable value
1: x A 40
2: y A 40
3: x B 25
4: y B 50
5: x C 50
6: y C 25
Data
df <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))

Repeating blocks of rows in a data frame based on another value in the data frame

There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.
I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:
df <- data.frame(ID = rep(1:2, each = 5),
Response = sample(LETTERS[1:4], 10, replace = TRUE),
Weight = rep(c(2,3), each = 5))
> df
ID Response Weight
1 1 D 2
2 1 C 2
3 1 D 2
4 1 D 2
5 1 B 2
6 2 D 3
7 2 C 3
8 2 B 3
9 2 D 3
10 2 B 3
I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:
ID Response Weight
1 11 D 2
2 11 C 2
3 11 D 2
4 11 D 2
5 11 B 2
6 12 D 2
7 12 C 2
8 12 D 2
9 12 D 2
10 12 B 2
11 21 D 3
12 21 C 3
13 21 B 3
14 21 D 3
15 21 B 3
16 22 D 3
17 22 C 3
18 22 B 3
19 22 D 3
20 22 B 3
21 23 D 3
22 23 C 3
23 23 B 3
24 23 D 3
25 23 B 3
The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.
Here's my code:
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
Is there a faster way to do this?
There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
Is this method faster? Let's see:
library(microbenchmark)
microbenchmark(
m1 = {
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
},
m2 = {
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730 100
# m2 171.731 194.199 245.7246 214.3725 283.145 506.184 100
There might be other more efficient ways.
Another approach would be to use data.table.
Assuming you're starting with "DT" as your data.table, try:
library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]
I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.
Data for testing. Change n to create a larger dataset to play with.
set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
Response = sample(LETTERS[1:5], sum(weights), TRUE),
Weight = rep(weights, weights))
DT <- as.data.table(df)

In R: dcast in function, pass column names (again!)

Given a df in semi-long format with id variables a and b and measured data in columns m1and m2. The type of data is specified by the variable v (values var1 and var2).
set.seed(8)
df_l <-
data.frame(
a = rep(sample(LETTERS,5),2),
b = rep(sample(letters,5),2),
v = c(rep("var1",5),rep("var2",5)),
m1 = sample(1:10,10,F),
m2 = sample(20:40,10,F))
Looks as:
a b v m1 m2
1 W r var1 3 40
2 N l var1 6 32
3 R a var1 9 28
4 F g var1 5 21
5 E u var1 4 38
6 W r var2 1 35
7 N l var2 8 33
8 R a var2 10 29
9 F g var2 7 30
10 E u var2 2 23
If I want to make a wide format of values in m1 using id a as rows and values in v1as columns I do:
> reshape2::dcast(df_l, a~v, value.var="m1")
a var1 var2
1 E 4 2
2 F 5 7
3 N 6 8
4 R 9 10
5 W 3 1
How do I write a function that does this were arguments to dcast (row, column and value.var) are supplied as arguments, something like:
fun <- function(df,row,col,val){
require(reshape2)
res <-
dcast(df, row~col, value.var=val)
return(res)
}
I checked SO here and here to try variations of match.call and eval(substitute()) in order to "get" the arguments inside the function, and also tried with the lazyeval package. No succes.
What am I doing wrong here ? How to get dcast to recognize variable names?
Formula argument also accepts character input.
foo <- function(df, id, measure, val) {
dcast(df, paste(paste(id, collapse = " + "), "~",
paste(measure, collapse = " + ")),
value.var = val)
}
require(reshape2)
foo(df_l, "a", "v", "m1")
Note that data.table's dcast (current development) can also cast multiple value.var columns directly. So, you can also do:
require(data.table) # v1.9.5
foo(setDT(df_l), "a", "v", c("m1", "m2"))
# a m1_var1 m1_var2 m2_var1 m2_var2
# 1: F 1 6 28 21
# 2: H 9 2 38 29
# 3: M 5 10 24 35
# 4: O 8 3 23 26
# 5: T 4 7 31 39

R: colSums when not all columns are numeric

I have the following data frame
Type CA AR
alpha 1 5
beta 4 9
gamma 3 8
I want to get the column and row sums such that it looks like this:
Type CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
I am able to do rowSums (as shown above) I guess because they are all numeric.
colSums(df)
However, when I do colSums I get the error 'x must be numeric.' I realize that this is because the "Type" column is not numeric.
If I do the following code such that I try to print the value into the 4th row (and only the 2nd through 4th columns are summed)
df[,4] = colSums(df[c(2:4)]
Then I get an error that replacement isn't same as data size.
Does anyone know how to work around this? I want to print the column sums for columns 2-4, and leave the 1st column total blank or allow me to print "Total"?
Thanks in advance!!
Checkout numcolwise() in the plyr package.
library(plyr)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
numcolwise(sum)(df)
Result:
CA AR
1 8 22
Use a matrix:
m <- as.matrix(df[,-1])
rownames(m) <- df$Type
# CA AR
# alpha 1 5
# beta 4 9
# gamma 3 8
Then add margins:
addmargins(m,FUN=c(Total=sum),quiet=TRUE)
# CA AR Total
# alpha 1 5 6
# beta 4 9 13
# gamma 3 8 11
# Total 8 22 30
The simpler addmargins(m) also works, but defaults to labeling the margins with "Sum".
You are right, it is because the first column is not numeric.
Try to use the first column as rownames:
df <- data.frame(row.names = c("alpha", "beta", "gamma"), CA = c(1, 4, 3), AR = c(5, 9, 8))
df$Total <- rowSums(df)
df['Total',] <- colSums(df)
df
The output will be:
CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
If you need the word 'Type', just remove the rownames and add the column back:
Type <- rownames(df)
df <- data.frame(Type, df, row.names=NULL)
df
And it's output:
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
Use:
df$Total <- df$CA + df$AR
A more general solution:
data$Total <- Reduce('+',data[, sapply(data, is.numeric)])
EDIT: I realize I completely misunderstood the question. you are indeed looking for the sum of rows, and I gave sum of columns.
To do rows instead:
data <- data.frame(x = 1:3, y = 4:6, z = as.character(letters[1:3]))
data$z <- as.character(data$z)
rbind(data,sapply(data, function(y) ifelse(test = is.numeric(y), Reduce('+',y), "Total")))
If you do not know which columns are numeric, but rather want the sums across rows then do this:
df$Total = rowSums( df[ sapply(df, is.numeric)] )
The is.numeric function will return a logical value which is valid for selecting columns and sapply will return the logical values as a vector.
To add a set of column totals and a grand total we need to rewind to the point where the dataset was created and prevent the "Type" column from being constructed as a factor:
dat <- read.table(text="Type CA AR
alpha 1 5
beta 4 9
gamma 3 8 ",stringsAsFactors=FALSE)
dat$Total = rowSums( dat[ sapply(dat, is.numeric)] )
rbind( dat, append(c(Type="Total"),
as.list(colSums( dat[ sapply(dat, is.numeric)] ))))
#----------
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
That's a data.frame:
> str( rbind( dat, append(c(Type="Total"), as.list(colSums( dat[ sapply(dat, is.numeric)] )))) )
'data.frame': 4 obs. of 4 variables:
$ Type : chr "alpha" "beta" "gamma" "Total"
$ CA : num 1 4 3 8
$ AR : num 5 9 8 22
$ Total: num 6 13 11 30
I think this should solve your problem
x<-data.frame(type=c('alpha','beta','gama'), x=c(1,2,3), y=c(4,5,6))
x[,'Total'] <- rowSums(x[,c(2:3)])
x<-rbind(x,c(type = c('Total'), c(colSums(x[,c(2:4)]))))
library(tidyverse)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
df2 <- colSums(df[, c("CA", "AR")])
# CA AR
# 8 22

Returning above and below rows of specific rows in r dataframe

Consider any dataframe
col1 col2 col3 col4
row.name11 A 23 x y
row.name12 A 29 x y
row.name13 B 17 x y
row.name14 A 77 x y
I have a list of rownames which I want to return from this dataframe. Lets say I have row.name12 and row.name13 in a list. I can easily return these rows from dataframe. But I also want to return 4 rows above and 4 rows below these rows. It means I want to return from row.name8 to row.name17. I think it is similar to grep -A -B in shell.
Probable solution- Is there any way to return row number by row name? Because if I have row number than I can easily subtract 4 and add 4 in row number and return rows.
Note: Here rownames are just examples. Rownames could be anything like RED, BLUE, BLACK, etc.
Try that:
extract.with.context <- function(x, rows, after = 0, before = 0) {
match.idx <- which(rownames(x) %in% rows)
span <- seq(from = -before, to = after)
extend.idx <- c(outer(match.idx, span, `+`))
extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
extend.idx <- sort(unique(extend.idx))
return(x[extend.idx, , drop = FALSE])
}
dat <- data.frame(x = 1:26, row.names = letters)
extract.with.context(dat, c("a", "b", "j", "y"), after = 3, before = 1)
# x
# a 1
# b 2
# c 3
# d 4
# e 5
# i 9
# j 10
# k 11
# l 12
# m 13
# x 24
# y 25
# z 26
Perhaps a combination of which() and %in% would help you:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1, 1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name14 A 77 x y
In the above, we are trying to identify which row names in "dat" are "row.name13" (using which()), and the + c(-1, 1) tells R to return the row before and the row after. If you wanted to include the row, you could do something like + c(-1:1).
To get the range of rows, switch the comma to a colon:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1:1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
Update
Matching a list is a little bit trickier, but without thinking about it too much, here is a possibility:
myRows <- c("row.name12", "row.name13")
rowRanges <- lapply(which(rownames(dat) %in% myRows), function(x) x + c(-1:1))
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 3 4
#
lapply(rowRanges, function(x) dat[x, ])
# [[1]]
# col1 col2 col3 col4
# row.name11 A 23 x y
# row.name12 A 29 x y
# row.name13 B 17 x y
#
# [[2]]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
This outputs a list of data.frames which might be handy since you might have duplicated rows (as there are in this example).
Update 2: Using grep if it is more appropriate
Here is a variation of your question, one which would be less convenient to solve using the which()...%in% approach.
set.seed(1)
dat1 <- data.frame(ID = 1:25, V1 = sample(100, 25, replace = TRUE))
rownames(dat1) <- paste("rowname", sample(apply(combn(LETTERS[1:4], 2),
2, paste, collapse = ""),
25, replace = TRUE),
sprintf("%02d", 1:25), sep = ".")
head(dat1)
# ID V1
# rowname.AD.01 1 27
# rowname.AB.02 2 38
# rowname.AD.03 3 58
# rowname.CD.04 4 91
# rowname.AD.05 5 21
# rowname.AD.06 6 90
Now, imagine you wanted to identify the rows with AB and AC, but you don't have a list of the numeric suffixes.
Here's a little function that can be used in such a scenario. It borrows a little from #Spacedman to make sure that the rows returned are within the range of the data (as per #flodel's suggestion).
getMyRows <- function(data, matches, range) {
rowMatches = lapply(unlist(lapply(matches, function(x)
grep(x, rownames(data)))), function(y) y + range)
rowMatches = lapply(rowMatches, function(x) x[x > 0 & x <= nrow(data)])
lapply(rowMatches, function(x) data[x, ])
}
You can use it as follows (but I won't print the results here). First, specify the dataset, then the pattern(s) you want matched, then the range (in this example, three rows before and four rows after).
getMyRows(dat1, c("AB", "AC"), -3:4)
Applying it to the earlier example of matching row.name12 and row.name13, you can use it as follows: getMyRows(dat, c(12, 13), -1:1).
You can also modify the function to make it more general (for example, to specify matching with a column instead of row names).
Create some sample data:
> dat=data.frame(col1=letters,col2=sample(26),col3=sample(letters))
> dat
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
...
Set our target vector (note I choose an edge case and overlapping cases), and find matching rows:
> target=c("a","e","g","s")
> match = which(dat$col1 %in% target)
Create sequences from -2 to +2 of the matches (adjust for your needs) and merge:
> getThese = unique(as.vector(mapply(seq,match-2,match+2)))
> getThese
[1] -1 0 1 2 3 4 5 6 7 8 9 17 18 19 20 21
Fix the edge cases:
> getThese = getThese[getThese > 0 & getThese <= nrow(dat)]
> dat[getThese,]
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
4 d 22 d
5 e 2 j
6 f 9 l
7 g 1 w
8 h 21 n
9 i 17 p
17 q 18 a
18 r 10 m
19 s 24 o
20 t 13 e
21 u 3 k
>
Remember our targets were a, e, g and s. You've now got those plus two rows above and two rows below for each, with no duplicates.
If you are using row names, just create 'match' from those. I was using a column.
I'd write a bunch more tests using the testthat package if this were my problem.
Another option will be to use filter. In case stats::filter is masked e.g. by dplyr::filter you have to use stats::filter.
dat <- data.frame(x = seq_along(letters), row.names = letters)
i <- rownames(dat) %in% c("a", "b", "j", "y") #Get the matches
nAfter <- 3
nBefore <- 1
fi <- seq(-nBefore, nAfter)
n <- max(abs(x))
fi <- seq(-n, n) %in% fi
dat[head(tail(filter(c(rep(FALSE, n), i, rep(FALSE, n)), fi), -n), -n) > 0,, drop = FALSE]
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#i 9
#j 10
#k 11
#l 12
#m 13
#x 24
#y 25
#z 26
I would simply proceed as follow:
dat[(grep("row.name12",row.names(dat))-4):(grep("row.name13",row.names(dat))+4),]
grep("row.name12",row.names(dat)) gives you the row number that have "row.name12" as name, so
(grep("row.name12",row.names(dat))-4):(grep("row.name13",row.names(dat))+4)
gives you a serie of row numbers ranging from the 4th row preceding the row named "row.name12" to the 4th row after the one named "row.name13".

Resources