Say I wanted to work with hospital Medicare data showing procedure prices by hospital and by county and my data frame was called df with columns price, procedure and county. If I wanted to find the minimum and maximum prices for each procedure by county, I could so something like
library(plyr)
mostexpensive <- ddply(df,c('county','procedure'),function(x)x[which(x$price==max(x$price)),])
to get a table showing the hospitals with the most expensive procedures in each county. I can then see how many times each hospital is listed with
summary(mostexpensive$hospital)
For the final step I want to add a column to the original df dataframe that says TRUE if the row is most expensive and FALSE otherwise but I can't figure out how to get a logical vector from a plyr function. Thanks.
Posting reproducible code would be useful. Try this anyway,
For the summary
pricey <- ddply(df, c('county','procedure'), summarise, most = max(price), less=min(price))
and for the logical indexing
testing <- ddply(df, c('county','procedure'), mutate, expensive = price == max(price))
It will be more easier to get an answer with a reproductible example. You should think about it, next time you as for help in SO.
That being said, you can use the transform function to add a new column to your existing data.
The first step is to create a toy data set.
set.seed(123)
df <- data.frame(
county = sample(LETTERS[1:3], size = 20, replace = TRUE),
procedure = sample(c(1, 2), size = 20, replace = TRUE),
price = rpois(20, 10)
)
str(df)
## 'data.frame': 20 obs. of 3 variables:
## $ county : Factor w/ 3 levels "A","B","C": 1 3 2 3 3 1 2 3 2 2 ...
## $ procedure: num 2 2 2 2 2 2 2 2 1 1 ...
## $ price : int 6 8 6 8 4 6 6 8 5 12 ...
Now we can use plyr and the transform function
require(plyr)
expensive <- ddply(df, .(county, procedure),
transform, ismax = price == max(price))
expensive
## county procedure price ismax
## 1 A 1 9 FALSE
## 2 A 1 7 FALSE
## 3 A 1 12 TRUE
## 4 A 2 6 FALSE
## 5 A 2 6 FALSE
## 6 A 2 8 TRUE
## 7 B 1 5 FALSE
## 8 B 1 12 TRUE
## 9 B 2 6 FALSE
## 10 B 2 6 FALSE
## 11 B 2 12 TRUE
## 12 B 2 11 FALSE
## 13 C 1 9 TRUE
## 14 C 1 9 TRUE
## 15 C 2 8 FALSE
## 16 C 2 8 FALSE
## 17 C 2 4 FALSE
## 18 C 2 8 FALSE
## 19 C 2 12 TRUE
## 20 C 2 12 TRUE
Related
Here is my sample dataset:
mydata = data.frame (ID =c(1,2,3,4,5),
subject = c("His","Geo","Geo","His","Geo"),
age = c(21,24,26,23,26))
I would like to add a row at the top. I would like it to say "School 1" in the ID column while all other columns remain blank. The following is what I am looking for:
mydata = data.frame (ID =c("School 1",1,2,3,4,5),
subject = c(NA,"His","Geo","Geo","His","Geo"),
age = c(NA,21,24,26,23,26))
I have tried the following, but it ends up populating the value across all columns:
mydata <- rbind(c("School 1"), mydata)
I know the following code will get me what I want, but I would like to avoid having to list out NA's as my dataset has tons of columns
mydata <- rbind(c("School 1", NA,NA), mydata)
Any help is appreciated!
A possible solution, based on dplyr. We first need to convert ID from numeric to character.
library(dplyr)
mydata %>%
mutate(ID = as.character(ID)) %>%
bind_rows(list(ID = "School 1"), .)
#> # A tibble: 6 × 3
#> ID subject age
#> <chr> <chr> <dbl>
#> 1 School 1 <NA> NA
#> 2 1 His 21
#> 3 2 Geo 24
#> 4 3 Geo 26
#> 5 4 His 23
#> 6 5 Geo 26
Using `length<-`, which fills up non existing elements with NA up to a specified total length you may create a vector with length exactly to ncol(mydata), with the first element 'School 1', then rbind.
rbind(`length<-`("School 1", ncol(mydata)), mydata)
# ID subject age
# 1 School 1 <NA> NA
# 2 1 His 21
# 3 2 Geo 24
# 4 3 Geo 26
# 5 4 His 23
# 6 5 Geo 26
Explanation
Maybe it is worth thinking about the concept of a data frame to be able to better understand OP's problem. Actually it's a modified list,
typeof(mydata)
# [1] "list"
and consists of vectors as elements with equal lengths, which we can see when we unclass it.
unclass(mydata)
# $ID
# [1] 1 2 3 4 5
#
# $subject
# [1] "His" "Geo" "Geo" "His" "Geo"
#
# $age
# [1] 21 24 26 23 26
#
# attr(,"row.names")
# [1] 1 2 3 4 5
We may easily add other elements to a list,
c(mydata, foo='something')
# $ID
# [1] 1 2 3 4 5
#
# $subject
# [1] "His" "Geo" "Geo" "His" "Geo"
#
# $age
# [1] 21 24 26 23 26
#
# $foo
# [1] "something"
but making a data frame out of it, the values are getting recycled, if nothing else is provided (which is actually very useful).
as.data.frame(c(mydata, foo='something'))
# ID subject age foo
# 1 1 His 21 something
# 2 2 Geo 24 something
# 3 3 Geo 26 something
# 4 4 His 23 something
# 5 5 Geo 26 something
This is exactly the same with cbind.
cbind(mydata, foo='something')
# ID subject age foo
# 1 1 His 21 something
# 2 2 Geo 24 something
# 3 3 Geo 26 something
# 4 4 His 23 something
# 5 5 Geo 26 something
If we provide a vector of appropriate length (i.e. a column to the data frame), R has no reason to recycle.
as.data.frame(c(mydata, list(foo=c('something', rep(NA, 4)))))
# ID subject age foo
# 1 1 His 21 something
# 2 2 Geo 24 <NA>
# 3 3 Geo 26 <NA>
# 4 4 His 23 <NA>
# 5 5 Geo 26 <NA>
cbind(mydata, foo=c('something', rep(NA, 4)))
# ID subject age foo
# 1 1 His 21 something
# 2 2 Geo 24 <NA>
# 3 3 Geo 26 <NA>
# 4 4 His 23 <NA>
# 5 5 Geo 26 <NA>
Adding rows is slightly different. As we easily may see in the unclassed data frame above, we may imagine, that we need to append something to each single vector at the desired position. It goes against the grain, so to speak. Obviously this is also computational more expensive, and thus much slower.
as.data.frame(Map(append, mydata, values=c('something', rep(NA, 2)), after=0))
# ID subject age
# 1 something <NA> <NA>
# 2 1 His 21
# 3 2 Geo 24
# 4 3 Geo 26
# 5 4 His 23
# 6 5 Geo 26
Notice, that to append a shorter vector than ncol will also result in recycling as experienced by OP.
as.data.frame(Map(append, mydata, values='something', after=0))
# ID subject age
# 1 something something something
# 2 1 His 21
# 3 2 Geo 24
# 4 3 Geo 26
# 5 4 His 23
# 6 5 Geo 26
R's rbind already cares for this in C language, which is fast, and we don't need to Map over anything;
rbind(c("something", NA, NA), mydata)
to avoid endless typing of NA we may thus use the proposed solution:
rbind(`length<-`("School 1", ncol(mydata)), mydata)
# ID subject age
# 1 something <NA> <NA>
# 2 1 His 21
# 3 2 Geo 24
# 4 3 Geo 26
# 5 4 His 23
# 6 5 Geo 26
I have a dataset in which time is represented as spells (i.e. from time 1 to time 2), like this:
d <- data.frame(id = c("A","A","B","B","C","C"),
t1 = c(1,3,1,3,1,3),
t2 = c(2,4,2,4,2,4),
value = 1:6)
I want to reshape this into a panel dataset, i.e. one row for each unit and time period, like this:
result <- data.frame(id = c("A","A","A","A","B","B","B","B","C","C","C","C"),
t= c(1:4,1:4,1:4),
value = c(1,1,2,2,3,3,4,4,5,5,6,6))
I am attempting to do this with tidyr and gather but not getting the desired result. I am trying something like this which is clearly wrong:
gather(d, 't1', 't2', key=t)
In the actual dataset the spells are irregular.
You were almost there.
Code
d %>%
# Gather the needed variables. Explanation:
# t_type: How will the call the column where we will put the former
# variable names under?
# t: How will we call the column where we will put the
# values of above variables?
# -id,
# -value: Which columns should stay the same and NOT be gathered
# under t_type (key) and t (value)?
#
gather(t_type, t, -id, -value) %>%
# Select the right columns in the right order.
# Watch out: We did not select t_type, so it gets dropped.
select(id, t, value) %>%
# Arrange / sort the data by the following columns.
# For a descending order put a "-" in front of the column name.
arrange(id, t)
Result
id t value
1 A 1 1
2 A 2 1
3 A 3 2
4 A 4 2
5 B 1 3
6 B 2 3
7 B 3 4
8 B 4 4
9 C 1 5
10 C 2 5
11 C 3 6
12 C 4 6
So, the goal is to melt t1 and t2 columns and to drop the key column that will appear as a result. There are a couple of options. Base R's reshape seems to be tedious. We may, however, use melt:
library(reshape2)
melt(d, measure.vars = c("t1", "t2"), value.name = "t")[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
where -3 drop the key column. We may indeed also use gather as in
gather(d, "key", "t", t1, t2)[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
This question already has answers here:
Extract elements common in all column groups
(3 answers)
Closed 3 years ago.
I have a tidy data.frame with two columns: exp and val. I want to find which values of val are shared among all different experiments.
df <- data.frame(exp = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'),
val = c(10, 20, 15, 10, 10, 15, 99, 2, 15, 20, 10, 4))
df
exp val
1 A 10
2 A 20
3 A 15
4 A 10
5 B 10
6 B 15
7 B 99
8 B 2
9 C 15
10 C 20
11 C 10
12 C 4
Expected result could be either a vector of values:
10, 15
or a column on the data frame telling if that value is shared:
exp val shared
<fct> <dbl> <lgl>
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
I was able to find an answer (see the self-answer below) but this seems like a common enough question that there must be a better way than the really hacky solution I cam up with.
I tried to solve this problem in dplyr since that's what I'm familiar with, but I'm interested in any kind of solution.
Or you can group by val and then check whether the number of distinct exp for that val is equal to the data frame level number of distinct exp:
df %>%
group_by(val) %>%
mutate(shared = n_distinct(exp) == n_distinct(.$exp))
# notice the first exp refers to exp for each group while .$exp refers
# to the overall exp column in the data frame
# A tibble: 12 x 3
# Groups: val [6]
# exp val shared
# <fct> <dbl> <lgl>
# 1 A 10 TRUE
# 2 A 20 FALSE
# 3 A 15 TRUE
# 4 A 10 TRUE
# 5 B 10 TRUE
# 6 B 15 TRUE
# 7 B 99 FALSE
# 8 B 2 FALSE
# 9 C 15 TRUE
#10 C 20 FALSE
#11 C 10 TRUE
#12 C 4 FALSE
Using base R you can use table:
as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)])
[1] 10 15
you can also do:
df %>%
mutate(s = val %in% as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)]))
exp val s
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
Here is an other base R solution:
x <- split(df$val, df$exp)
Reduce(intersect, x)
## [1] 10 15
We can go through the data.frame row by row and count up how many times that row's value is found in the vector df$val.
To deal with possible repeat values, we have to use group_by %>% distinct to remove repeated values of val within groups. But then to get just the values of val as a vector, we need to ungroup %>% select(val) %>% unlist, which just seems needlessly complicated.
Finally, we can check whether the number of groups the value is found in equals the total number of groups.
df %>%
rowwise() %>%
mutate(num_groups = sum(group_by(., exp) %>%
distinct(val) %>%
ungroup() %>%
select(val) %>%
unlist() %in% val),
shared = num_groups == length(unique(.$exp)))
# A tibble: 12 x 4
exp val num_groups shared
<fct> <dbl> <int> <lgl>
1 A 10 3 TRUE
2 A 20 2 FALSE
3 A 15 3 TRUE
4 A 10 3 TRUE
5 B 10 3 TRUE
6 B 15 3 TRUE
7 B 99 1 FALSE
8 B 2 1 FALSE
9 C 15 3 TRUE
10 C 20 2 FALSE
11 C 10 3 TRUE
12 C 4 1 FALSE
I have a very basic question, but I'm new to R so would appreciate any help.
I have a column (among other columns) in one dataset which the rows read as numeric codes(for example).
In another dataset, I have two columns, one is the numeric codes (same as above) and the column next to it are names.
Is there a way in R that I can rename the numeric codes in the first datasets to the names using the second dataset as a reference essentially?
Many thanks for your help
Some sample data:
set.seed(42) # because I use `sample`, o/w unnecessary
df1 <- data.frame(n = sample(5, size = 10, replace = TRUE))
str(df1)
# 'data.frame': 10 obs. of 1 variable:
# $ n: int 5 5 2 5 4 3 4 1 4 4
df2 <- data.frame(n = 1:5, txt = LETTERS[5:9], stringsAsFactors = FALSE)
str(df2)
# 'data.frame': 5 obs. of 2 variables:
# $ n : int 1 2 3 4 5
# $ txt: chr "E" "F" "G" "H" ...
Base R use of merge:
merge(df1, df2, by = "n")
# n txt
# 1 1 E
# 2 2 F
# 3 3 G
# 4 4 H
# 5 4 H
# 6 4 H
# 7 4 H
# 8 5 I
# 9 5 I
# 10 5 I
Notice that the order of df1 is not preserved. We can use merge(..., sort = FALSE), but the order is "unspecified" (?merge).
Using dplyr::left_join:
library(dplyr)
df1 %>%
left_join(df2, by = "n")
# n txt
# 1 5 I
# 2 5 I
# 3 2 F
# 4 5 I
# 5 4 H
# 6 3 G
# 7 4 H
# 8 1 E
# 9 4 H
# 10 4 H
(Order is preserved.)
I have a dataset of true values (location) that I'm attempting to compare to a vector of estimated values using dplyr. My code below results in an error message. How do I compare each value of data$location to every value of est.locations and collapse the resulting vector to true if all comparisons are greater than 20?
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = (all(abs(location - est.locations) > 20)))
num location false.neg
1 1 453.4281 FALSE
2 2 454.4260 FALSE
3 3 718.0420 FALSE
4 4 801.2217 FALSE
5 5 802.7981 FALSE
6 6 854.2148 FALSE
7 7 873.6085 FALSE
8 8 901.0217 FALSE
9 9 1032.8321 FALSE
10 10 1240.3547 FALSE
Warning message:
In c(...) :
longer object length is not a multiple of shorter object length
The context of the question is dplyr, but I'm open to other suggestions that may be faster. This is a piece of a larger calculation I'm doing on birth-death mcmc chains for 3000 iterations * 200 datasets. (i.e. repeated many times and the number of locations will be different among datasets and for each iteration.)
UPDATE (10/13/15):
I'm going to mark akrun's solution as the answer. A linear algebra approach is a natural fit for this problem and with a little tweaking this will work for calculating both FNR and FPR (FNR should need an (l)apply by iteration, FPR should be one large vector/matrix operation).
JohannesNE's solution points out the issue with my initial approach -- the use of any() reduces the number of rows to a single value, when instead I intended to do this operation row-wise. Which also leads me to think there is likely a dplyr solution using rowwise() and do().
I attempted to limit the scope of the question in my initial post. But for added context, the full problem is on a Bayesian mixture model with an unknown number of components, where the components are defined by a 1D point process. Estimation results in a 'random effects' chain similar in structure to the version of est.locations below. The length mismatch is a result of having to estimate the number of components.
## Clarification of problem
options("max.print" = 100)
set.seed(1)
# True values (number of items and their location)
true.locations <-
data.frame("num" = 1:10,
"location" = runif(10, 0, 1500) %>% sort)
# Mcmc chain of item-specific values ('random effects')
iteration <<- 0
est.locations <-
lapply(sample(10:14, 3000, replace=T), function(x) {
iteration <<- iteration + 1
total.items <- rep(x, x)
num <- 1:x
location <- runif(x, 0, 1500) %>% sort
data.frame(iteration, total.items, num, location)
}) %>% do.call(rbind, .)
print(est.locations)
iteration total.items num location
1 1 11 1 53.92243818
2 1 11 2 122.43662006
3 1 11 3 203.87297671
4 1 11 4 641.70211495
5 1 11 5 688.19477968
6 1 11 6 1055.40283048
7 1 11 7 1096.11595818
8 1 11 8 1210.26744065
9 1 11 9 1220.61185888
10 1 11 10 1362.16553219
11 1 11 11 1399.02227302
12 2 10 1 160.55916378
13 2 10 2 169.66834129
14 2 10 3 212.44257723
15 2 10 4 228.42561489
16 2 10 5 429.22830291
17 2 10 6 540.42659572
18 2 10 7 594.58339156
19 2 10 8 610.53964624
20 2 10 9 741.62600969
21 2 10 10 871.51458277
22 3 13 1 10.88957267
23 3 13 2 42.66629869
24 3 13 3 421.77297967
25 3 13 4 429.95036650
[ reached getOption("max.print") -- omitted 35847 rows ]
You can use sapply (here inside mutate, but not really taking advantage of its functions).
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = sapply(location, function(x) {
all(abs(x - est.locations) > 20)
}))
num location false.neg
1 1 92.67941 TRUE
2 2 302.52290 FALSE
3 3 398.26299 TRUE
4 4 558.18585 FALSE
5 5 859.28005 TRUE
6 6 943.67107 TRUE
7 7 991.19669 TRUE
8 8 1347.58453 TRUE
9 9 1362.31168 TRUE
10 10 1417.01290 FALSE
We can use outer for this kind of comparison. We get all the combination of difference between 'location' and 'est.locations', take the abs, compare with 20, negate (!), do the rowSums and negate again so that if all the elements in the rows are greater than 20, it will be TRUE.
data$false.neg <- !rowSums(!abs(outer(data$location, est.locations, FUN='-'))>20)