Constructing numeric flag with switch command - r

I have a data frame with a variable ind_percentiles, which assumes values from 1 to 100. I want to create a numeric categorical variable—ind_Q—which takes the values of:
1 if ind_percentiles < 25
2 if ind_percentiles 26 - 50
3 if ind_percentiles 51 - 75
4 if ind_percentiles 76 - 100
I want to accomplish this using a switch statement rather than an if: else.
Data$IMD_Q = 0
switch(Data$Index_Q,
1 = {Data$ind_percentiles <= 25},
2 = {Data$ind_percentiles > 25},
3 = {Data$ind_percentiles > 50},
4 = {Data$ind_percentiles > 75})
Is this possible? How do I achieve this?

Related

What type returns table in R?

I wrote this lines of code below.
I want to get the most frequent value in matrix:
matrix7 <- matrix(sample(1:36, 100, replace = TRUE), nrow = 1)
t <- table(matrix7)
print(t)
a <- which.max(table(matrix7))
print(unlist(a))
it prints this:
> matrix7 <- matrix(sample(1:36, 100, replace = TRUE), nrow = 1)
> t <- table(matrix7)
> print(t)
matrix7
1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 34 35 36
4 5 1 5 2 5 1 3 1 4 2 2 2 5 5 1 3 7 2 3 2 3 2 1 4 4 2 2 2 5 2 5 3
> a <- which.max(table(matrix7))
> print(unlist(a))
19
18
>
What type is my t variable and a variable,
and how can I get the most frequent value from matrix?
To know the "type" of variable use:
class(t)
class(a)
But notice you are already setting your matrix7 as table here: t <- table(matrix7) while your variable a is an integer.
To get the most common element on your variable (t in your case):
sort(table(as.vector(t)))
In general, if you want to know the "type" (more properly called the class) of an object, use the function class:
> class(t)
[1] "table"
There are a few ways you can find the most frequent value. Given that you have already calculated the which.max, you can take the corresponding name of t:
> as.numeric(names(t)[a])
[1] 5 ## I have a different random number seed to you :)
Note that you can't just take t[a] since that might return an integer code (factors are integers underneath, and the integer might not be what you expect).
In your example, the object a is an integer vector of length one. The "data" is 18, and it has the "name" 19. Hence another and perhaps simpler way to get the most frequent value is to take names(a).
You can either use class() to get the the class attribute of an R object or typeof() to get the type or storage mode.
Class and type of a are 'integer', the class of t is 'table' and the type is 'integer'.
Note that a is a named integer, this is why 2 values are printed. If you use names(a) it will only return the value (as a character) of a.
If you use which.max(tabulate(matrix7)) it will return the value without the need to change it further.
which.max(tabulate(matrix7))
[1] 16
(Side node: since no seed is in your code the result differs, you can set it using set.seed(x) where x is an integer).

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Row aggregation when values are close enough in a column

I have a dataframe with 2 columns
time x
1306247226 5
1306247236 10
1306248127 20
1306248187 36
1306249248 28
1306249258 24
1306249259 20
...
I'd like to aggregate the rows whose values in the 'time' column are close enough
(eg. let's say their difference is less than 60.) and sum their 'x' values in the aggregated row. The 'time value in the aggregated row will be the one of the first row of the aggregation. ('time' is an unix timestamp)
The goal is to have as output of this example:
time x
1306247226 15
1306248127 20
1306248187 36
1306249248 72
...
The dataset is quite big, a 'for' loop will take a long time... but if it is the only option I can deal with it and wait.
Any idea?
Thanks a lot!
You can use something like this :
First I create a new column for aggregation
dat$gg <- cumsum(c(0,diff(dat$time)) > 60)
Then I use the plyr package to apply function aggregation
library(plyr)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 56
3 2 1306249248 72
Edit after comment
The Op wanted a threshold of 60, not greater than 60. So I need to change the > to >=
dat$gg <- cumsum(c(0,diff(dat$time)) >= 60)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 20
3 2 1306248187 36
4 3 1306249248 72

How to call the factor level of a variable for each observation, and use those values to create a new variable in R?

I have a dataset with a categorical variable hospital_code which has 10 levels.
The program that I am running loops through and takes a subset of the data such that the variable compLbl contains exactly 2 of the 10 hospital_codes so that they can be compared to each other. I now have a situation where in each loop, I need compLbl to be binary coded (1s, and 0s).
If I just take the subset data from the first loop in which the possible values for compLbl are AMH, and BJH, I can easily do this as follows:
nData$compLbl2 = with(nData,(ifelse(compLbl == "AMH", 1,0)))
And get data that looks like this:
head(nData)
compLbl outLbl Race_Code Age Complexity_Subclass_Code compLbl2
1 AMH 0 W 63 1 1
2 AMH 0 W 44 2 1
3 AMH 0 W 88 3 1
4 BHC 0 W 64 1 0
5 BHC 0 W 61 2 0
6 BHC 0 W 61 1 0
How can I generalize this so that no matter what two values are in compLbl it will binary code them? My thought was to possibly do this by referencing factor level 1 for whatever two values are present in the factor variable compLbl. Like this:
nData$compLbl2 = with(nData,(ifelse(FACTORLEVEL(compLbl) == 1, 1,0)))
Where in my above example FACTORLEVEL(compLbl) would return a 1 for AMH and a 2 for BHC since those are the factor levels that R would automatically assign. However, I'm not sure how to do this, or if it is possible.
I would use this command:
nData <- within(nData, compLbl2 = rev(as.numeric(compLbl[drop = TRUE]) -1))

Resources