I have a set of data which shows the visit ID and the subject name
visit<-c(1,2,3,1,2,1,1,2,3,1,2,3)
subject<-c("A","A","A","B","B","C","D","D","D","E","E","E")
data<-data.frame(visit=visit,subject=subject)
I attempted to work out the latest visit ID for each subject:
tapply(visit,subject,max)
And I get this output:
A B C D E
3 2 1 3 3
I am wondering if there is any way that I can change the output such that it becomes:
A 3
B 2
C 1
D 3
E 3
Thank you
You can try aggregate
aggregate(visit~subject, data, max)
# subject visit
#1 A 3
#2 B 2
#3 C 1
#4 D 3
#5 E 3
Or from tapply
res <- tapply(visit,subject,max)
data.frame(subject=names(res), visit=res)
Or data.table
library(data.table)
setDT(data)[, list(visit=max(visit)), by=subject]
And a dplyr solution would be:
library(dyplr)
data %>% group_by(subject) %>% summarize(max = max(visit))
## Source: local data frame [5 x 2]
## subject max
## 1 A 3
## 2 B 2
## 3 C 1
## 4 D 3
## 5 E 3
It may feel dirty, but using the base function as.matrix (or matrix for that matter) will give you what you need.
> as.matrix(tapply(visit,subject,max))
[,1]
A 3
B 2
C 1
D 3
E 3
You can easily do this in base R with stack:
stack(tapply(visit, subject, max))
# values ind
# 1 3 A
# 2 2 B
# 3 1 C
# 4 3 D
# 5 3 E
(Note: In this case, the values for "visit" and "subject" aren't actually coming from your data.frame. Just thought you should know!)
(Second note: You could also do data.frame(as.table(tapply(visit, subject, max))) but that is more deceptive than using stack so may lead to less readable code later on.)
Related
The expand.grid gives the results ordered by the last entered set, but I need it based on the first set.
Given the following code:
expand.grid(a=(1:2),b=c("a","b","c"))
a b
1 1 a
2 2 a
3 1 b
4 2 b
5 1 c
6 2 c
Notice how column a changes most often with b less often.
The algorithm it seems is lock the 2nd or Nth variable b and then alternate the 1st or (N-1) variable until the grid gets to every combination possible in the grid.
I need to expand.grid or a similar function that first sets the 1st variable and then adjusts the 2nd variable and so on until it gets to all N.
The desired result for the example is:
a b
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
One way I that works for the example is simply to order by column a, but that does not work as I would need to be able to order by N columns in order and I have not found a way to do so.
It seems so trivial, but I cannot find a way to get expand.grid to behave like I need.
Any solution must work on any arbitrary number of entries to expand.grid and of any arbitrary size. Thank you.
try to do so
library(tidyverse)
df <- expand.grid(a=(1:2),b=c("a","b","c"))
df %>%
arrange_all()
We can use crossing from tidyr
library(tidyr)
crossing(a = 1:2, b = c('a', 'b', 'c'))
# A tibble: 6 x 2
# a b
# <int> <chr>
#1 1 a
#2 1 b
#3 1 c
#4 2 a
#5 2 b
#6 2 c
Here is a base-R solution, that works with any amount of variables without knowing the content beforehand.
Gather all the variables in a list, with the desired order in which you want to expand. Apply a reverse function rev first on the list in expand.grid and a second time on the output to get the desired expanding result.
Your example:
l <- list(a=(1:2),b=c("a","b","c"))
rev(expand.grid(rev(l)))
#> a b
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 a
#> 5 2 b
#> 6 2 c
An example with 3 variables:
var1 <- c("SR", "PL")
var2 <- c(1,2,3)
var3 <- c("A",'B')
l <- list(var1,var2,var3)
rev(expand.grid(rev(l)))
#> Var3 Var2 Var1
#> 1 SR 1 A
#> 2 SR 1 B
#> 3 SR 2 A
#> 4 SR 2 B
#> 5 SR 3 A
#> 6 SR 3 B
#> 7 PL 1 A
#> 8 PL 1 B
#> 9 PL 2 A
#> 10 PL 2 B
#> 11 PL 3 A
#> 12 PL 3 B
Try this:
expand.grid(b=c("a","b","c"), a=(1:2))[, c("a", "b")]
#> a b
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 a
#> 5 2 b
#> 6 2 c
Created on 2020-03-19 by the reprex package (v0.3.0)
I have the following data frame
v1 v2 v3
a 2 5
b 5 3
c 2 1
d 2 1
e 1 2
a 2 4
a 8 1
e 1 6
b 0 1
c 2 8
d 1 5
using R, I want to compute for every unique value of V1, the difference between the max V3 and the min V3.
Expected :
Val max_min
a “5-1”
b “3-1”
c “8-1”
d “5-1”
e “6-2”
I am trying using
ddply(fil1, c("V1"), summarise, max(V3) - min(V1))
but, don't have the expected result. It gives the same value in max_min: the max(V3) - min(V3) for the whole data frame and not for the group.
I have also try average, with no success.
Or in base R,
MAX = aggregate(df$v3, list(df$v1), max)
MIN = aggregate(df$v3, list(df$v1), min)
MAX[,2] - MIN[,2]
[1] 4 2 7 4 4
A one liner of the above would be,
aggregate(v3 ~ v1, df, FUN = function(i)max(i) - min(i))
# v1 v3
#1 a 4
#2 b 2
#3 c 7
#4 d 4
#5 e 4
We can also use tapply which will display the output as follows,
with(df, tapply(v3, list(v1), function(i) max(i)-min(i)))
#a b c d e
#4 2 7 4 4
You could also go for split:
lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
# $a
# [1] 4
# $b
# [1] 2
# $c
# [1] 7
# $d
# [1] 4
# $e
# [1] 4
In case you persist to see your defined output:
ls <- lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
data.frame(Val=names(ls), max_min=unlist(ls))
# Val max_min
#a a 4
#b b 2
#c c 7
#d d 4
#e e 4
If you're using dplyr you can use the summarise function. In base R, range returns a vector containing the min and max values, and diff finds the difference. So a one-liner is:
df %>% group_by(V1) %>% summarise(max_min=diff(range(V3)))
Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6
I have the following table
id mycol counter
1 a 1
1 b 2
2 c 1
2 c 2
2 e 3
And this is what I neee
ID 1 2 3
1 a b done
2 c c done
I try to use the dcast function
mydata<-dcast(mydata, id~mycol, counter, value = 'mycol')
but It's not working, any idea?
It appears from your question that you're trying to do something like a reshaping from long to wide format. Here's how you can use base R reshape() to do this:
mydata <- data.frame(id=c(1L,1L,2L,2L,2L),mycol=c('a','b','c','c','e'),counter=c(1L,2L,1L,2L,3L),stringsAsFactors=F);
reshape(mydata,dir='w',idvar='id',timevar='counter');
## id mycol.1 mycol.2 mycol.3
## 1 1 a b <NA>
## 3 2 c c e
reshape() does not support such precise control over the resulting column names. You can fix them up yourself afterward. Assuming you saved the above result as res, you can do this:
colnames(res) <- sub(perl=T,'^mycol\\.','',colnames(res));
res;
## id 1 2 3
## 1 1 a b <NA>
## 3 2 c c e
I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
here´s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2