Formatting the output in R - r

I have a set of data which shows the visit ID and the subject name
visit<-c(1,2,3,1,2,1,1,2,3,1,2,3)
subject<-c("A","A","A","B","B","C","D","D","D","E","E","E")
data<-data.frame(visit=visit,subject=subject)
I attempted to work out the latest visit ID for each subject:
tapply(visit,subject,max)
And I get this output:
A B C D E
3 2 1 3 3
I am wondering if there is any way that I can change the output such that it becomes:
A 3
B 2
C 1
D 3
E 3
Thank you

You can try aggregate
aggregate(visit~subject, data, max)
# subject visit
#1 A 3
#2 B 2
#3 C 1
#4 D 3
#5 E 3
Or from tapply
res <- tapply(visit,subject,max)
data.frame(subject=names(res), visit=res)
Or data.table
library(data.table)
setDT(data)[, list(visit=max(visit)), by=subject]

And a dplyr solution would be:
library(dyplr)
data %>% group_by(subject) %>% summarize(max = max(visit))
## Source: local data frame [5 x 2]
## subject max
## 1 A 3
## 2 B 2
## 3 C 1
## 4 D 3
## 5 E 3

It may feel dirty, but using the base function as.matrix (or matrix for that matter) will give you what you need.
> as.matrix(tapply(visit,subject,max))
[,1]
A 3
B 2
C 1
D 3
E 3

You can easily do this in base R with stack:
stack(tapply(visit, subject, max))
# values ind
# 1 3 A
# 2 2 B
# 3 1 C
# 4 3 D
# 5 3 E
(Note: In this case, the values for "visit" and "subject" aren't actually coming from your data.frame. Just thought you should know!)
(Second note: You could also do data.frame(as.table(tapply(visit, subject, max))) but that is more deceptive than using stack so may lead to less readable code later on.)

Related

R how to get a result like expand.grid, but control the order of the expansion?

The expand.grid gives the results ordered by the last entered set, but I need it based on the first set.
Given the following code:
expand.grid(a=(1:2),b=c("a","b","c"))
a b
1 1 a
2 2 a
3 1 b
4 2 b
5 1 c
6 2 c
Notice how column a changes most often with b less often.
The algorithm it seems is lock the 2nd or Nth variable b and then alternate the 1st or (N-1) variable until the grid gets to every combination possible in the grid.
I need to expand.grid or a similar function that first sets the 1st variable and then adjusts the 2nd variable and so on until it gets to all N.
The desired result for the example is:
a b
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
One way I that works for the example is simply to order by column a, but that does not work as I would need to be able to order by N columns in order and I have not found a way to do so.
It seems so trivial, but I cannot find a way to get expand.grid to behave like I need.
Any solution must work on any arbitrary number of entries to expand.grid and of any arbitrary size. Thank you.
try to do so
library(tidyverse)
df <- expand.grid(a=(1:2),b=c("a","b","c"))
df %>%
arrange_all()
We can use crossing from tidyr
library(tidyr)
crossing(a = 1:2, b = c('a', 'b', 'c'))
# A tibble: 6 x 2
# a b
# <int> <chr>
#1 1 a
#2 1 b
#3 1 c
#4 2 a
#5 2 b
#6 2 c
Here is a base-R solution, that works with any amount of variables without knowing the content beforehand.
Gather all the variables in a list, with the desired order in which you want to expand. Apply a reverse function rev first on the list in expand.grid and a second time on the output to get the desired expanding result.
Your example:
l <- list(a=(1:2),b=c("a","b","c"))
rev(expand.grid(rev(l)))
#> a b
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 a
#> 5 2 b
#> 6 2 c
An example with 3 variables:
var1 <- c("SR", "PL")
var2 <- c(1,2,3)
var3 <- c("A",'B')
l <- list(var1,var2,var3)
rev(expand.grid(rev(l)))
#> Var3 Var2 Var1
#> 1 SR 1 A
#> 2 SR 1 B
#> 3 SR 2 A
#> 4 SR 2 B
#> 5 SR 3 A
#> 6 SR 3 B
#> 7 PL 1 A
#> 8 PL 1 B
#> 9 PL 2 A
#> 10 PL 2 B
#> 11 PL 3 A
#> 12 PL 3 B
Try this:
expand.grid(b=c("a","b","c"), a=(1:2))[, c("a", "b")]
#> a b
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 a
#> 5 2 b
#> 6 2 c
Created on 2020-03-19 by the reprex package (v0.3.0)

Difference of maximum and minimum by group

I have the following data frame
v1 v2 v3
a 2 5
b 5 3
c 2 1
d 2 1
e 1 2
a 2 4
a 8 1
e 1 6
b 0 1
c 2 8
d 1 5
using R, I want to compute for every unique value of V1, the difference between the max V3 and the min V3.
Expected :
Val max_min
a “5-1”
b “3-1”
c “8-1”
d “5-1”
e “6-2”
I am trying using
ddply(fil1, c("V1"), summarise, max(V3) - min(V1))
but, don't have the expected result. It gives the same value in max_min: the max(V3) - min(V3) for the whole data frame and not for the group.
I have also try average, with no success.
Or in base R,
MAX = aggregate(df$v3, list(df$v1), max)
MIN = aggregate(df$v3, list(df$v1), min)
MAX[,2] - MIN[,2]
[1] 4 2 7 4 4
A one liner of the above would be,
aggregate(v3 ~ v1, df, FUN = function(i)max(i) - min(i))
# v1 v3
#1 a 4
#2 b 2
#3 c 7
#4 d 4
#5 e 4
We can also use tapply which will display the output as follows,
with(df, tapply(v3, list(v1), function(i) max(i)-min(i)))
#a b c d e
#4 2 7 4 4
You could also go for split:
lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
# $a
# [1] 4
# $b
# [1] 2
# $c
# [1] 7
# $d
# [1] 4
# $e
# [1] 4
In case you persist to see your defined output:
ls <- lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
data.frame(Val=names(ls), max_min=unlist(ls))
# Val max_min
#a a 4
#b b 2
#c c 7
#d d 4
#e e 4
If you're using dplyr you can use the summarise function. In base R, range returns a vector containing the min and max values, and diff finds the difference. So a one-liner is:
df %>% group_by(V1) %>% summarise(max_min=diff(range(V3)))

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

How to transpose when the value is a txt an the new column is a number

I have the following table
id mycol counter
1 a 1
1 b 2
2 c 1
2 c 2
2 e 3
And this is what I neee
ID 1 2 3
1 a b done
2 c c done
I try to use the dcast function
mydata<-dcast(mydata, id~mycol, counter, value = 'mycol')
but It's not working, any idea?
It appears from your question that you're trying to do something like a reshaping from long to wide format. Here's how you can use base R reshape() to do this:
mydata <- data.frame(id=c(1L,1L,2L,2L,2L),mycol=c('a','b','c','c','e'),counter=c(1L,2L,1L,2L,3L),stringsAsFactors=F);
reshape(mydata,dir='w',idvar='id',timevar='counter');
## id mycol.1 mycol.2 mycol.3
## 1 1 a b <NA>
## 3 2 c c e
reshape() does not support such precise control over the resulting column names. You can fix them up yourself afterward. Assuming you saved the above result as res, you can do this:
colnames(res) <- sub(perl=T,'^mycol\\.','',colnames(res));
res;
## id 1 2 3
## 1 1 a b <NA>
## 3 2 c c e

Condensing Data Frame in R

I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
here´s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2

Resources