Conditional calculations in R - r

I have a dataframe with categories and values. Based on the category I want to subtract values that are stored in another table.
myframe <- data.frame(
x = factor(c("A", "D", "A", "C")),
y = c(8, 3, 9, 9))
reference <- c('A'= 1, 'B'= 2, 'C'= 3, 'D'= 4)
The desired (y-ref) outcome would be:
result <- data.frame(
x = factor(c("A", "D", "A", "C")),
y = c(8, 3, 9, 9),
r = c(7, -1, 8, 6))
x y r
1 A 8 7
2 D 3 -1
3 A 9 8
4 C 9 6
The reference 'table' is a named vector in this case but it could be changed to a better suited data format.
I am not sure how to accomplish this...

This is a fairly straight forward task using match and [...
myframe$r <- myframe$y - reference[ match( myframe$x , names( reference ) ) ]
# x y r
#1 A 8 7
#2 D 3 -1
#3 A 9 8
#4 C 9 6
Pretty sure this is a (several-times over) duplicate so we should find you a good pointer and close the question (but I commend you for showing input data and desired result, many questions are often not that well laid out).
EDIT
Well there are many, many match based questions on the site. It's hard to pick one to point to as an exact duplicate. But I suggest having a browse of a few of these by searching for "r match" (you can search by specific tags by enclosing the search term in square brackets like this "[r]").

The data.table way:
library(data.table)
# convert to data.table and set key for the upcoming merge
dt = data.table(myframe, key = 'x')
ref = data.table(x = names(reference), val = reference)
# merge and add a new column
dt[ref, r := y - val]
dt
# x y r
#1: A 8 7
#2: A 9 8
#3: C 9 6
#4: D 3 -1

Related

How to store vector in dataframe in R

I am trying to create a dataframe through a for loop and trying to add a vector for each row of the data frame.
The rows are people and the columns are the name category and points category.
For example I'm trying to have something like...
Name Points
Susie c(12,45,23)
Bill c(13,24,12,89)
CJ c(12)
So far my code looks like
names_list <-c("Susie","Bill","CJ")
result = data.frame()
for (name in names_list){
listing = .....
frame = data.frame(name,listing)
names(frame) = c("name","list")
result <- rbind(result,frame)
}
Where listing happens to be the points associated with that name. However instead of creating 1 row for each name containing all their points, it creates multiple rows with the same name for each point.
Result looks like
1 Susie 12
2 Susie 45
3 Susie 23
4 Bill 13
5 Bill 24
6 Bill 12
7 Bill 89
8 CJ 12
The specific problem you've encountered is due to data.frame flattening any list inputs. This can be prevented using the identify function I. For example,
data.frame(a = 1, b = list(c("a", "b")))
doesn't do what you want, but
data.frame(a = 1, b = I(list(c("a", "b"))))
does. A discussion of this behavior and some alternatives are available at http://r4ds.had.co.nz/many-models.html#list-columns-1
You can use I to produce the desired result using your example as well:
names_list <-c("Susie","Bill","CJ")
points <- list(c(12,45,23),
c(13,24,12,89),
12)
result = data.frame()
for (i in 1:length(names_list)){
frame = data.frame(names_list[[i]], I(points[i]))
names(frame) = c("name","list")
result <- rbind(result,frame)
}
though as pointed out in the comments, there are better ways to do it. All you really need is
data.frame(
name = names_list,
points = I(points))
I don't know in what structure your vector value sare but in general to nest vectors in a column you can do something like this:
names <- c("A", "B", "C")
vectors <- list(list(1,2,3), list(4,5,6), list(7,8,9))
as.data.frame(cbind(names, vectors))
names vectors
1 A 1, 2, 3
2 B 4, 5, 6
3 C 7, 8, 9
Name <- c("Suzie", "Bill", "CJ")
Points <- list(c( 12,45,23),
c(13,24,12,89),
c( 12)
)
result <- as.data.frame(cbind(Name, Points))
Then:
print(result)
Gives:
> result
Name Points
1 Suzie 12, 45, 23
2 Bill 13, 24, 12, 89
3 CJ 12
Note:
print(result$Points)
Gives:
> result$Points
[[1]]
[1] 12 45 23
[[2]]
[1] 13 24 12 89
[[3]]
[1] 12

Using a custom summary function for factors within multiple columns

I conducted a survey with a large number of items, each of which has distinct categorical response options stored as factors. I need to summarize these columns in an efficient manner, preferably with functionality like that provided by forcats::fct_count(). I also need to know how many non-NA responses were provided for each variable, since different items were shown to different respondents. I wrote a function to make a tidy little summary data frame, but am struggling to efficiently run this function along each column and then combine the results into a single object (ala ddply).
I've tried sapply(), gather()-ing the data to long format and then running ddply(), but the problem of the distinct levels for each variable seems to keep getting in the way. See below for a reproducible example of the data set and my summarizing function. I could run the function for each variable (as shown below), but I know there's gotta be a more efficient way to do this that doesn't involve creating a ton of individual summary data-frame objects. Thanks for any help you can provide.
data <- data.frame(
ID = c(1:50),
X = as.factor(sample(c("yes", "no", NA), 50, replace = TRUE)),
Y = as.factor(sample(c("a", "b", "c", NA), 50, replace = TRUE)),
Z = as.factor(sample(c("d", "e", "f", "g", "h", NA), 50, replace = TRUE))
)
library(tidyverse)
library(forcats)
factorsummaries.f <- function(x) {
x <- na.omit(x)
counts <- fct_count(fct_drop(x), sort = T)
counts$f <- as.character(counts$f)
total <- data.frame(f = "sum", n = as.numeric(sum(counts$n)))
return(bind_rows(counts, total))
}
factorsummaries.f(data$X)
factorsummaries.f(data$Y)
Perhaps you are looking for purrr::map_dfr
map_dfr(data[,2:ncol(data)], factorsummaries.f, .id = "colname")
#output
colname f n
<chr> <chr> <dbl>
1 X no 18
2 X yes 17
3 X sum 35
4 Y a 14
5 Y c 13
6 Y b 12
7 Y sum 39
8 Z g 10
9 Z d 9
10 Z h 8
11 Z f 6
12 Z e 5
13 Z sum 38

Fast melted data.table operations

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.
The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.
A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:
input <- data.table(
id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input,
id ~ variable, sum,
subset=.(variable %in% c('x', 'y')))
the output of which is
id x y
1 1 1 5
2 2 4 11
3 3 15 9
Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :
setkey(input,variable)
input[c("x","y"),sum(value)]
This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :
input[,sum(value),keyby=variable][c("x","y")]
The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.
The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.
The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :
DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]
In future when secondary keys are implemented that would be easier :
set2key(input,variable) # add a secondary key
input[c("x","y"),sum(value),key=2] # syntax speculative
To group by id as well :
setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']
and including id in the key might be worth setkey's cost depending on your data :
setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']
If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.
Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :
input[c("x","y"),sum(value),by=id]
> setkey(input, "id")
> input[ , list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 34
> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 24
The last one:
> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
id variable V1
1: 1 x 1
2: 1 y 5
3: 2 x 4
4: 2 y 11
5: 3 x 15
6: 3 y 9
I'm not sure if this is the best way, but you can try:
input[, list(x = sum(value[variable == "x"]),
y = sum(value[variable == "y"])), by = "id"]
# id x y
# 1: 1 1 5
# 2: 2 4 11
# 3: 3 15 9

R repeat elements of data frame

I have searched the internet, but I haven't been able to find a solution to my problem.
I have a data frame of numbers and characters:
mydf <- data.frame(col1=c(1, 2, 3, 4),
col2 = c(5, 6, 7, 8),
col3 = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
mydf:
col1 col2 col3
1 5 a
2 6 b
3 7 c
4 8 d
I would like to repeat this into
col1 col2 col3
1 5 a
1 5 a
1 5 a
2 6 b
2 6 b
2 6 b
3 7 c
3 7 c
3 7 c
4 8 d
4 8 d
4 8 d
Using apply(mydf, 2, function(x) rep(x, each = 3)) will give the right repetition, but will not conserve the classes of col1, col2, and col3, as numeric, numeric and character, respectively, as I would like. This is a constructed example, and setting the classes of each column in my data frame is a bit tedious.
Is there a way to make the repetition while conserving the classes?
It's even easier than you think.
index <- rep(seq_len(nrow(mydf)), each = 3)
mydf[index, ]
This also avoids the implicit looping from apply.
This is an unfortunate and an unexpected class conversion (too me, anyway). Here's an easy workaround that uses the fact that a data.frame is just a special list.
data.frame(lapply(mydf, function(x) rep(x, each = 3)))
(anyone know why the behaviour the questioner observed shouldn't be reported as a bug?)
Just another solution:
mydf3 <- do.call(rbind, rep(list(mydf), 3))
Take a look at aggregate and disaggregate in the raster package. Or, use my modified version zexpand below:
# zexpand: analogous to disaggregate
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}
I really like Richie Cotton's answer.
But you could also simply use rbind and reorder it.
res <-rbind(mydf,mydf,mydf)
res[order(res[,1],res[,2],res[,3]),]
The package mefa comes with a nice wrapper for rep applied to data.frame. This will match your example in one line:
mefa:::rep.data.frame(mydf, each=3)

Simple data-manipulation in R

#Aniko points out that one way to view my problem is that I need to find the connected components of a graph, where the vertices are called groups and, variables group and nominated_group indicate an edges between those two groups. My goal is to create a variable parent_Group which indexes the connected components. Or as I put it before:
I have a dataframe with four variables: ID, group, and nominated_ID, and nominated_Group.
Consider sister-groups: Groups A and B are sister-groups if there is at least one case in the data where group==A and nominated_group==B, or vice versa.
I would like to create a variable parent_group which takes on a unique value for each set of sister-groups. In other words, no nominations should occur between cases in different parent_groups. Making the parent_group sequential numbers seems like a good idea.
Many thanks for the help I already received here! I can't really contribute here but note that I try to pay it forward at stats.exchange and on wikipedia.
In my fake data, A and B are sister-groups. Either case ID=4 or ID=5 are sufficient to make this true. Each group is also their own sister-group. The goal, the creation of parent_group, should result in one parent_group for all cases in A or B, and another parent_group for group C
df <- data.frame(ID = c(9, 5, 2, 4, 3, 7),
group = c("A", "A", "B", "B", "A", "C"),
nominated_ID = c(9, 8, 4, 9, 2, 7) )
df$nominated_group <- with(df, group[match(nominated_ID, ID)])
df
ID group nominated_ID nominated_group
1 9 A 9 A
2 5 A 8 <NA>
3 2 B 4 B
4 4 B 9 A
5 3 A 2 B
6 7 C 7 C
Consider a graph with the groups as its vertices and the edges indicating that the two groups occur for the same ID. Then I think you are looking for connected components of this graph. The following is a quick and dirty (and probably not optimal) implementation of this idea using the graph package:
library(graph)
#make some fake data
nom <- data.frame(group = c("A","A","A","B","B","C","C"),
group2 = c("A","A","B","B","A","C","C"),
stringsAsFactors=FALSE)
#remove duplicated pairs
#it will keep A-B distinct from B-A, could probably be fixed
nom1 <- nom[!duplicated(nom),]
#define empty graph
grps <- union(unique(nom$group), unique(nom$group2))
gg <- new("graphNEL", nodes=grps, edgeL=list())
#add an edge for every pair
for (i in 1:nrow(nom1)) gg <- addEdge(nom1$group[i], nom1$group2[i], gg, 1)
#find connected components
cc <- connComp(gg)
#assing parent by matching within cc
nom$parent <- apply(nom, 1,
function(x) which(sapply(cc, function(y) x["group"] %in% y)))
nom
group group2 parent
1 A A 1
2 A A 1
3 A B 1
4 B B 1
5 B A 1
6 C C 2
7 C C 2

Resources