Change values in data.frame conditionally - r

I am trying to check the value of one variable and if it meets a certain condition the new variable gets set to 1 or else it gets set to zero.
I am having difficulty with this in R.
This simple code does not work:
attach(data)
if (Drug = 1) {
Drug_factor <- 0
} else {
if (Drug = 2) {
Drug_factor <- 1
} else Drug_factor<- 0
I do not understand why this will not work.
Why does R use such complicated conventions for doing basic stuff ?

You can either use ifelse
Data$Drug_factor <- with(Data, ifelse(Drug==1, 0, 1))
Or use the factor approach
Data$Drug_factor <- with(Data, as.numeric(as.character(factor(Drug,
levels=1:2, labels=0:1))))
Or
Data$Drug_factor <- c(0,1)[(Data$Drug==2)+1]
Or even shorter assuming that the 'Drug' is 'numeric'
Data$Drug_factor <- c(0,1)[Data$Drug]
All these cases, assume that there are only two unique elements in 'Drug'.
Suppose if you have more than 2 unique elements in 'Drug', from the code, it seems to me that only when 'Drug==2', the value should be returned as 1. Creating another value in 'Drug'
Data$Drug[4] <- 3
In this case, we can change the ifelse condition such that when 'Drug' is 2 return 1 and for all others to return 0.
Data$Drug_factor <- with(Data, ifelse(Drug==2, 1, 0))
A similar option by indexing is,
Data$Drug_factor <- c(0,1)[(Data$Drug==2)+1]
data
set.seed(24)
Data <- data.frame(Drug= sample(1:2, 10, replace=TRUE), val=rnorm(10))

There are two different kinds of problems of this kind.
In the simple case, you want to change a small number of values to some other value. For this purpose, I find that using mapvalues() from plyr is a good solution. For example:
#lets pretend we have loaded some data where missing data is coded as 99
set.seed(1) #reproducible results
test_data = sample(c(0:5, 99), size = 1000, replace = T)
#table of our dta
table(test_data)
Output:
test_data
0 1 2 3 4 5 99
138 145 150 150 127 142 148
Recode:
#recode 99 to NA
library(plyr)
test_data_noNA = mapvalues(test_data, 99, NA)
table(test_data_noNA, exclude = NULL) #also count NAs
Output:
test_data_noNA
0 1 2 3 4 5 <NA>
138 145 150 150 127 142 148
In the other case, you want to conditionally change values to some other value, but there is a large/indefinite/infinite number of values it could be.
Example:
#continuous data
set.seed(1) #reproducible results
test_data = rnorm(1000) #normally distributed data
hist(test_data) #plot with histogram
However, let's say we want to deal with outliers, which we define at beyond 2SD from the mean. However, we don't just want to exclude them, so instead we will recode them.
#change values above 2 to 2
test_data[test_data > 2] = 2
#change valuesbelow -2 to -2
test_data[test_data < -2] = -2
hist(test_data) #plot with histogram

Related

Index and assign multiple sets of rows at once

I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp

Flip order Columns / Rows in a table

I'm using the epiR package as it does nice 2 by 2 contingency tables with odds ratios, and population attributable fractions.
As is common my data is coded
0 = No
1 = Yes
So when I do
tabele(var_1,var_2)
The output comes out as a table aligned like
For its input though epiR wants the top left square to be Exposed+VE Outcome+VE - i.e the top left square should be Var 1==1 and Var 2==1
Currently I do this by recoding the zeroes to 2 or alternatively by setting as a factor and using re-level. Both of these are slightly annoying for other analyses as in general I want Outcome+VE to come after Outcome-VE
So I wondered if there is an easy way (?within table) to flip the orientation of table so that it essentially inverts the ordering of the rows/columns?
Hope the above makes sense - happy to provide clarification if not.
Edit: Thanks for suggestions below; just for clarification I want to be able to do this when calling table from existing dataframe variable - i.e when what I am doing is table(data$var_1, data$var_2) - ideally without having to create a whole new object
Table is a simple matrix. You can just call indices in reverse order.
xy <- table(data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2]))
variable
value a b
0 20 22
1 30 28
xy[2:1, 2:1]
variable
value b a
1 20 30
0 30 20
Using factor levels:
# reproducible example (adapted from Roman's answer)
df1 <- data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2])
table(df1)
# variable
# value a b
# 0 32 23
# 1 18 27
#convert to factor, specify levels
df1$value <- factor(df1$value, levels = c("1", "0"))
df1$variable <- factor(df1$variable, levels = c("b", "a"))
table(df1)
# variable
# value b a
# 1 24 26
# 0 26 24

Vectorize access to columns in R

I am trying to get all the indexes that meet a condition in a colum. I've already done this in the case of having one column like this:
# Get a 10% of samples labeled with a 1
indexPositive = sample(which(datafsign$result == 1), nrow(datafsign) * .1)
It is possible to do the same operation vectoriced for any number of columns in one line as well? I imagine that in that case indexPositive would be a list or array with the indexes of each column.
Data
The data frame is as follow:
x y f1 f2 f3 f4
1 76.71655 60.74299 1 1 -1 -1
2 -85.73743 -19.67202 1 1 1 -1
3 75.95698 -27.20154 1 1 1 -1
4 -82.57193 39.30717 1 1 1 -1
5 -45.32161 39.44898 1 1 -1 -1
6 -46.76636 -35.30635 1 1 1 -1
The seed I am using is set.seed(1000000007)
What I want is the set of indexes with value 1. In the case of only one column the result is:
head(indexPositive)
[1] 1398 873 3777 2140 133 3515
Thanks in advance.
Answer
Thanks to #David Arenburg I finally did it. Based on his comment I created this function:
getPercentageOfData <- function(x, condition = 1, percentage = .1){
# Get the percentage of samples that meet condition
#
# Args:
# x: A vector containing the data
# condition: Condition that the data need to satisfy
# percentaje: What percentage of samples to get
#
# Returns:
# Indexes of the percentage of the samples that meet the condition
meetCondition = which(x == condition)
sample(meetCondition, length(meetCondition) * percentage)
}
And then I used like this:
# Get a 10% of samples labeled with a 1 in all 4 functions
indexPositive = lapply(datafunctions[3:6], getPercentageOfData)
# Change 1 by -1
datafunctions$f1[indexPositive$f1] = -1
datafunctions$f2[indexPositive$f2] = -1
datafunctions$f3[indexPositive$f3] = -1
datafunctions$f4[indexPositive$f4] = -1
It would be great to also assign the values -1 to each column at once instead of writing 4 lines, but I do not know how.
You can define your function as follows (you can also add replacement as a partameter)
getPercentageOfData <- function(x, condition = 1, percentage = .1, replacement = -1){
meetCondition <- which(x == condition)
replace(x, sample(meetCondition, length(meetCondition) * percentage), replacement)
}
Then select the columns you want to operate on and update datafunctions directly (without creating indexPositive and then manually updating)
cols <- 3:6
datafunctions[cols] <- lapply(datafunctions[cols], getPercentageOfData)
You can of course play around with the functions parameters within lapply as in (for example)
datafunctions[cols] <- lapply(datafunctions[cols],
getPercentageOfData, percentage = .8, replacement = -100)

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

How to split a column into multiple integers of fixed size?

I have an integer as a column that I would like to split into multiple, seperate integers
Creating a list of dataframes using split() doesn't work for my later purposes
df <- as.data.frame(runif(n = 10000, min = 1, max = 10))
where split() creates a list of dataframe which I can't use for further purposes, where I need a separate integer as "Values"
map.split <- split(df, (as.numeric(rownames(df)) - 1) %/% 250) # this is not the trick
My goal is to split the column into different integer (not saved under the Global Environment "Data", but "Values")
This would be the slow way:
VecList1 <- df[1:250,]
VecList2 <- df[251:500,]
with
str(VecList1)
Int [1:250] 1 1 10 5 3 ....
Any advice welcome
If I'm interpreting correctly (not clear to me), here's a reduced problem and what I think you're asking for.
set.seed(2)
df <- data.frame(x = runif(10, min = 1, max = 10))
df$Values <- (seq_len(nrow(df))-1) %/% 4
df
# x Values
# 1 2.663940 0
# 2 7.321366 0
# 3 6.159937 0
# 4 2.512467 0
# 5 9.494554 1
# 6 9.491275 1
# 7 2.162431 1
# 8 8.501039 1
# 9 5.212167 2
# 10 5.949854 2
If all you need is that Values column as its own object, then you can just change df$Values <- ... to Values <- ....
Here's one way of doing this (although it's probably better to figure out a way where you don't need a series of separate vectors, but rather work with columns in a single matrix):
df <- data.frame(a=runif(n = 10000, min = 1, max = 10))
mx<-matrix(df$a,nrow=250)
for (i in 1:NCOL(mx)) {
assign(paste0("VecList",i),mx[,i])}
Note: using assign is generally not advisable. Whatever it is you're trying to achieve, there's probably a better way of doing it without creating a series of new vectors in the global environment.

Resources