I'm using the epiR package as it does nice 2 by 2 contingency tables with odds ratios, and population attributable fractions.
As is common my data is coded
0 = No
1 = Yes
So when I do
tabele(var_1,var_2)
The output comes out as a table aligned like
For its input though epiR wants the top left square to be Exposed+VE Outcome+VE - i.e the top left square should be Var 1==1 and Var 2==1
Currently I do this by recoding the zeroes to 2 or alternatively by setting as a factor and using re-level. Both of these are slightly annoying for other analyses as in general I want Outcome+VE to come after Outcome-VE
So I wondered if there is an easy way (?within table) to flip the orientation of table so that it essentially inverts the ordering of the rows/columns?
Hope the above makes sense - happy to provide clarification if not.
Edit: Thanks for suggestions below; just for clarification I want to be able to do this when calling table from existing dataframe variable - i.e when what I am doing is table(data$var_1, data$var_2) - ideally without having to create a whole new object
Table is a simple matrix. You can just call indices in reverse order.
xy <- table(data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2]))
variable
value a b
0 20 22
1 30 28
xy[2:1, 2:1]
variable
value b a
1 20 30
0 30 20
Using factor levels:
# reproducible example (adapted from Roman's answer)
df1 <- data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2])
table(df1)
# variable
# value a b
# 0 32 23
# 1 18 27
#convert to factor, specify levels
df1$value <- factor(df1$value, levels = c("1", "0"))
df1$variable <- factor(df1$variable, levels = c("b", "a"))
table(df1)
# variable
# value b a
# 1 24 26
# 0 26 24
Related
I have this data set:
ID Type Frequency
1 A 0.136546185
2 A 0.228915663
3 B 0.006024096
4 C 0.008032129
I want to create a new column that change the Frequency vaules less than 0.00 in to "other" and keep other information as it is. Like this :
ID Type Frequency New_Frequency
1 A 0.136546185 0.136546185
2 A 0.228915663 0.228915663
3 B 0.006024096 other
4 C 0.008032129 other
I used mutate but I dont know how to keep the original frequency bigger than 0.00.
Can you please help me?
You can't achieve what you want in base r because you cannot mix characters and numerics in the same vector. If you are willing to convert everything to characters the other answers will work. If you want to keep them numeric you need to use NA rather than "other". You can also try the labelled package which allows something like SPSS labels or SAS formats on numeric data.
Using mutate():
library(dplyr)
d <- tibble(ID = 1:4,
Type = c("A", "A", "B", "C"),
Frequency = c(0.136546185, 0.228915663, 0.006024096, 0.008032129))
d %>%
mutate(New_Frequency = case_when(Frequency < .01 ~ "other",
TRUE ~ as.character(Frequency)))
You can use ifelse
transform(df, Frequency = ifelse(Frequency < 0.01, 'Other', Frequency))
# ID Type Frequency
#1 1 A 0.136546185
#2 2 A 0.228915663
#3 3 B Other
#4 4 C Other
Note that Frequency column is now character since a column can have data of only one type.
I have a data set involving 100 people and their diagnosis of 5 medical conditions. Any combinations of conditions can occur, but I've set it up so that the probability of condition D depends on condition A, and E depends on B.
set.seed(14)
numpeople <- 100
diagnoses <- data.frame(A=rbinom(100, 1, .15),
B=rbinom(100, 1, .1),
C=rbinom(100, 1, .2)
)
# Probability of diagnosis for D increases by .4 if patient has A, otherwise .5
diagnoses$D <- sapply(diagnoses$A, function(x) rbinom(1, 1, .4*x+.2))
# Probability of diagnosis for E increases by .3 if patient has B, otherwise rare
diagnoses$E <- sapply(diagnoses$B, function(x) rbinom(1, 1, .7*x+.1))
To make a co-occurrence matrix, where each cell is the number of people with both of the diagnoses in the row and column, I use matrix algebra:
diagnoses.dist <- t(as.matrix(diagnoses))%*%as.matrix(diagnoses)
diag(diagnoses.dist) <- 0
diagnoses.dist
> diagnoses.dist
A B C D E
A 0 1 1 11 3
B 1 0 0 1 7
C 1 0 0 5 4
D 11 1 5 0 4
E 3 7 4 4 0
Then I'd like to use a chord diagram to show the proportion of co-diagnoses for each diagnosis.
circos.clear()
circos.par(gap.after=10)
chordDiagram(diagnoses.dist, symmetric=TRUE)
By default, size of the sector (pie slice) allocated for each group is proportional to the number of links.
> colSums(diagnoses.dist) #Number of links related to each diagnosis
A B C D E
16 9 10 21 18
Is it possible to set the sector width to illustrate the number of people which each diagnosis?
> colSums(diagnoses) #Number of people with each diagnosis
A B C D E
16 8 20 29 18
This problem seems somewhat related to section 14.5 of the circlize book, but I'm not sure how to work the math for the gap.after argument.
Based on section 2.3 of the circlize book, I tried setting the sector size using circos.initalize but I think the chordDiagram function overrides this, because the scale on the outside is exactly the same.
circos.clear()
circos.par(gap.after=10)
circos.initialize(factors=names(diagnoses), x=colSums(diagnoses)/sum(diagnoses), xlim=c(0,1))
chordDiagram(diagnoses.dist, symmetric=TRUE)
I see a lot of options to fine-tune tracks in chordDiagram but not much for sectors. Is there a way this can be done?
In your case, Number of people in the category sometimes can be smaller than the total number of co-occurrence to other categories. For example, category B has totally 9 co-occurrence but the number of people is only 8.
If this is not the problem, you can put some values on the diagram of the matrix which correspond to the number people that only stay in one category. In following example code, I just add random numbers to the diagram to illustrate the idea:
diagnoses.dist <- t(as.matrix(diagnoses))%*%as.matrix(diagnoses)
diag(diagnoses.dist) = sample(10, 5)
# since the matrix is symmetric, we set the uppper diagnal to zero.
# we don't use `symmetrix = TRUE` here because the values on the diagonal
# are still used.
diagnoses.dist[upper.tri(diagnoses.dist)] = 0
par(mfrow = c(1, 2))
# here you can remove `self.link = 1` to see the difference
chordDiagram(diagnoses.dist, grid.col = 2:6, self.link = 1)
# If you don't want to see the "mountains"
visible = matrix(TRUE, nrow = nrow(diagnoses.dist), ncol = ncol(diagnoses.dist))
diag(visible) = FALSE
chordDiagram(diagnoses.dist, grid.col = 2:6, self.link = 1, link.visible = visible)
PS: link.visible option is only available in recent versions of circlize.
I am trying to check the value of one variable and if it meets a certain condition the new variable gets set to 1 or else it gets set to zero.
I am having difficulty with this in R.
This simple code does not work:
attach(data)
if (Drug = 1) {
Drug_factor <- 0
} else {
if (Drug = 2) {
Drug_factor <- 1
} else Drug_factor<- 0
I do not understand why this will not work.
Why does R use such complicated conventions for doing basic stuff ?
You can either use ifelse
Data$Drug_factor <- with(Data, ifelse(Drug==1, 0, 1))
Or use the factor approach
Data$Drug_factor <- with(Data, as.numeric(as.character(factor(Drug,
levels=1:2, labels=0:1))))
Or
Data$Drug_factor <- c(0,1)[(Data$Drug==2)+1]
Or even shorter assuming that the 'Drug' is 'numeric'
Data$Drug_factor <- c(0,1)[Data$Drug]
All these cases, assume that there are only two unique elements in 'Drug'.
Suppose if you have more than 2 unique elements in 'Drug', from the code, it seems to me that only when 'Drug==2', the value should be returned as 1. Creating another value in 'Drug'
Data$Drug[4] <- 3
In this case, we can change the ifelse condition such that when 'Drug' is 2 return 1 and for all others to return 0.
Data$Drug_factor <- with(Data, ifelse(Drug==2, 1, 0))
A similar option by indexing is,
Data$Drug_factor <- c(0,1)[(Data$Drug==2)+1]
data
set.seed(24)
Data <- data.frame(Drug= sample(1:2, 10, replace=TRUE), val=rnorm(10))
There are two different kinds of problems of this kind.
In the simple case, you want to change a small number of values to some other value. For this purpose, I find that using mapvalues() from plyr is a good solution. For example:
#lets pretend we have loaded some data where missing data is coded as 99
set.seed(1) #reproducible results
test_data = sample(c(0:5, 99), size = 1000, replace = T)
#table of our dta
table(test_data)
Output:
test_data
0 1 2 3 4 5 99
138 145 150 150 127 142 148
Recode:
#recode 99 to NA
library(plyr)
test_data_noNA = mapvalues(test_data, 99, NA)
table(test_data_noNA, exclude = NULL) #also count NAs
Output:
test_data_noNA
0 1 2 3 4 5 <NA>
138 145 150 150 127 142 148
In the other case, you want to conditionally change values to some other value, but there is a large/indefinite/infinite number of values it could be.
Example:
#continuous data
set.seed(1) #reproducible results
test_data = rnorm(1000) #normally distributed data
hist(test_data) #plot with histogram
However, let's say we want to deal with outliers, which we define at beyond 2SD from the mean. However, we don't just want to exclude them, so instead we will recode them.
#change values above 2 to 2
test_data[test_data > 2] = 2
#change valuesbelow -2 to -2
test_data[test_data < -2] = -2
hist(test_data) #plot with histogram
I'd like to create a variable that bins values from another variable based on a binwidth
The data would look something like this if I wanted to create a bin variable based on counts where:
1 to 5 = 1
6 to 10 = 2
11 to 15 = 3
Without hand recoding each bin is there a function that can do something like this in R?
Since it looks like you want to get a numeric rather than a factor result, try something like trunc((mydata$count-1)/5)+1
e.g.
mydata$bucket = trunc((mydata$count-1)/5)+1
There's also the ceiling function, which is a little simpler:
mydata$bucket = ceiling(mydata$count/5)
see ?round
So on your data:
mydata = data.frame(spend=c(21,32,34,43,36,39,33,47,47,47,25,50,44,44) ,
count=c(3L,1L,2L,15L,1L,8L,1L,11L,15L,11L,3L,12L,11L,4L) )
mydata$bucket = ceiling(mydata$count/5)
Which gives:
> mydata
spend count bucket
1 21 3 1
2 32 1 1
3 34 2 1
4 43 15 3
5 36 1 1
6 39 8 2
7 33 1 1
8 47 11 3
9 47 15 3
10 47 11 3
11 25 3 1
12 50 12 3
13 44 11 3
14 44 4 1
Yeah its called the cut function
? cut
You can use the generic cut() function. For a numeric vector x, the method has these arguments:
> args(cut.default)
function (x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE,
dig.lab = 3L, ordered_result = FALSE, ...)
The argument breaks is central here. It is either a number of intervals or a vector of “breakpoints” defining your intervals. Note that all intervals are by default right-open (right = TRUE), so by creating an object x, containing the numbers from 1 to 100 and defining a vector of breakpoints (brk) {1, 20, 50, 100}, you will get these results (after using table() on the result):
> x <- 1:100
> brk <- c(1,20,50,100)
> table(cut(x = x, breaks = brk))
(1,20] (20,50] (50,100]
19 30 50
You can see that the first interval is $(1,\,20]$, so 1 is not part of it and the first observation will become a missing value NA (as all other observations outside the defined intervals).
By setting include.lowest = TRUE, R includes the lowest value (i.e., the first interval will be closed), so I think this will produce what you want:
> x <- 1:100
> brk <- c(1,20,50,100)
> table(cut(x = x, breaks = brk, include.lowest = TRUE))
[1,20] (20,50] (50,100]
20 30 50
The argument right reverses the whole process, so intervals are left-open by default and include.lowest will close the last interval (i.e., include the highest value in the last category).
As the resulting object will be of class "factor", you might consider setting ordered_result to TRUE, producing an ordered factor object (classes "ordered" and "factor").
Labelling, etc. is optional (see ?cut).
The cut function can actually accomplish binning a variable while keeping it as a continuous variable you just need to use the labels parameter:
myData$bucket <- cut(myData$counts, breaks = 30, labels = rep(1:30))
I have an integer as a column that I would like to split into multiple, seperate integers
Creating a list of dataframes using split() doesn't work for my later purposes
df <- as.data.frame(runif(n = 10000, min = 1, max = 10))
where split() creates a list of dataframe which I can't use for further purposes, where I need a separate integer as "Values"
map.split <- split(df, (as.numeric(rownames(df)) - 1) %/% 250) # this is not the trick
My goal is to split the column into different integer (not saved under the Global Environment "Data", but "Values")
This would be the slow way:
VecList1 <- df[1:250,]
VecList2 <- df[251:500,]
with
str(VecList1)
Int [1:250] 1 1 10 5 3 ....
Any advice welcome
If I'm interpreting correctly (not clear to me), here's a reduced problem and what I think you're asking for.
set.seed(2)
df <- data.frame(x = runif(10, min = 1, max = 10))
df$Values <- (seq_len(nrow(df))-1) %/% 4
df
# x Values
# 1 2.663940 0
# 2 7.321366 0
# 3 6.159937 0
# 4 2.512467 0
# 5 9.494554 1
# 6 9.491275 1
# 7 2.162431 1
# 8 8.501039 1
# 9 5.212167 2
# 10 5.949854 2
If all you need is that Values column as its own object, then you can just change df$Values <- ... to Values <- ....
Here's one way of doing this (although it's probably better to figure out a way where you don't need a series of separate vectors, but rather work with columns in a single matrix):
df <- data.frame(a=runif(n = 10000, min = 1, max = 10))
mx<-matrix(df$a,nrow=250)
for (i in 1:NCOL(mx)) {
assign(paste0("VecList",i),mx[,i])}
Note: using assign is generally not advisable. Whatever it is you're trying to achieve, there's probably a better way of doing it without creating a series of new vectors in the global environment.