Assigning labels in R based on ID? - r

I have a data frame as follows:
DF<-data.frame(a=c(1,1,1,2,2,2,3,3,4,4),b=c(43,23,45,65,43,23,65,76,87,4))
a b
1 43
1 23
1 45
2 65
2 43
2 23
3 65
3 76
4 87
4 4
I want to set a flag like this:
a b flag
1 43 A
1 23 B
1 45 C
2 65 A
2 43 B
2 23 C
3 65 A
3 76 B
4 87 A
4 4 B
How can I get this done in R?

Using dplyr
library(dplyr)
DF %>% group_by(a) %>% mutate(flag=LETTERS[row_number()])
Using data.table(HT to #David Arenberg)
library(data.table)
setDT(DF)[, flag := LETTERS[1:.N], a]
And a soon to be vintage solution (by #Roman Luštrik)
do.call("c", sapply(rle(DF$a)$lengths, FUN = function(x) LETTERS[1:x]))
Addendum
#akrun suggested following extension of the LETTERS to address the immediate question arose "What if there is more than 26 groups?" (by #James)
Let <- unlist(sapply(1:3, function(i) do.call(paste0,expand.grid(rep(list(LETTERS),i)))))
All above codes remain fully functional, when LETTERS replaced by Let.

I'll thrown in one more in base R:
transform(DF, flag = LETTERS[ave(a,a,FUN=seq_along)])

Related

How to replace NAs with values from another column in data.table (Example given)?

DT is data.table and I want to replace NAs with values from visits column and Expected_DT is desired DT.
DT<-data.table(name=c("x","x","x","x"),hour=1:4,count=c(NA,45,56,78),visits=c(14,45,56,78))
name hour count visits
1: x 1 NA 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
This is what I want
Expected_DT<-data.table(name=c("x","x","x","x"),hour=1:4,count=c(14,45,56,78),visits=c(14,45,56,78))
name hour count visits
1: x 1 14 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
A few options:
1) using fcoalesce
DT[, count := fcoalesce(visits, count)]
2) using is.na:
DT[is.na(count), count := visits]
3) using fifelse:
DT[, count := fifelse(is.na(count), visits, count)]
4) using set and using sindri_baldur's comment on [[ for faster indexing:
ix <- DT[is.na(count), which=TRUE]
set(DT, ix, "count", DT[["visits"]][ix])
Solution using data.table:
DT[is.na(count), count:=visits]
DT
Returns:
name hour count visits
1: x 1 14 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
Some base R solutions
using ifelse
DT <- within(DT, count <- ifelse(is.na(count),visits,count))
using rowSums
DT <- within(DT, count <- rowSums(cbind(is.na(count)*visits,count),na.rm = TRUE))
And here is a dplyr version to be complete for other users:
library(dplyr)
DT %>%
mutate(count = if_else(is.na(count), visits, count))
name hour count visits
1 x 1 14 14
2 x 2 45 45
3 x 3 56 56
4 x 4 78 78

Gathering columns from wide to long by id [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I've got a data frame like this:
set.seed(100)
drugs <- data.frame(id = 1:5,
drug_1 = letters[1:5], drug_dos_1 = sample(100,5),
drug_2 = letters[3:7], drug_dos_2 = sample(100,5)
)
id drug_1 drug_dos_1 drug_2 drug_dos_2
1 a 31 c 49
2 b 26 d 81
3 c 55 e 37
4 d 6 f 54
5 e 45 g 17
I'd like to transform this messy table into a tidy table with all drugs of an id in one column and the corresponding drug dosages in one column. The table should look like this in the end:
id drug dosage
1 a 31
1 c 49
2 b 26
2 d 81
etc
I guess this could be achieved by using a reshaping function that transforms by data from wide to long format but I didn't manage.
One option is melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(drugs), measure = patterns('^drug_\\d+$', 'dos'),
value.name = c('drug', 'dosage'))[, variable := NULL][order(id)]
# id drug dosage
#1: 1 a 31
#2: 1 c 49
#3: 2 b 26
#4: 2 d 81
#5: 3 c 55
#6: 3 e 37
#7: 4 d 6
#8: 4 f 54
#9: 5 e 45
#10 5 g 17
Here, the 'drug' is common in all the columns, so we need to create a unique pattern. One way is to specify the starting location (^) followed by the 'drug' substring, then underscore (_) and one or more numbers (\\d+) at the end ($) of the string. For the 'dos', just use that substring to match those column names that have 'dos'
library(dplyr)
drugs %>% gather(key,val,-id) %>% mutate(key=gsub('_\\d','',key)) %>% #replace _1 and _2 at the end wiht nothing
mutate(key=gsub('drug_','',key)) %>% group_by(key) %>% #replace drug_ at the start of dos with nothin and gruop by key
mutate(row=row_number()) %>% spread(key,val) %>%
select(id,drug,dos,-row)
# A tibble: 10 x 3
id drug dos
<int> <chr> <chr>
1 1 a 31
2 1 c 49
3 2 b 26
4 2 d 81
5 3 c 55
6 3 e 37
7 4 d 6
8 4 f 54
9 5 e 45
10 5 g 17
Warning message:
attributes are not identical across measure variables;
they will be dropped
#This warning generated as we merged drug(chr) and dose(num) into one column (val)

Add data from a data table to another using values of a column

I know the question is confusing, but I hope the example will make it simple.
I have two tables:
x y
1 23
2 34
3 76
4 31
&
x y
1 78
3 51
5 54
I need to add the y columns based on x values. I can do it using loops, but don't want to. It will be better if the solution uses base, dplyr, data.table functions as I am most familiar with those, I am okay with apply family of functions as well. The output should look like this:
x y
1 101
2 34
3 127
4 31
5 54
The basic idea is to combine the two dataset, group by x and summarize y with sum and there are a couple of ways to do it:
data.table:
rbind(dtt1, dtt2)[, .(y = sum(y)), by = x]
# x y
# 1: 1 101
# 2: 2 34
# 3: 3 127
# 4: 4 31
# 5: 5 54
base R aggregate:
aggregate(y ~ x, rbind(dtt1, dtt2), FUN = sum)
dplyr:
rbind(dtt1, dtt2) %>% group_by(x) %>% summarize(y = sum(y))
The data:
library(data.table)
dtt1 <- fread('x y
1 23
2 34
3 76
4 31')
dtt2 <- fread('x y
1 78
3 51
5 54')

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources