R: How do I split a variable into named intervals? - r

I am working on a rather large dataset in R, containing a continuous numeric variable. In another dataset, I have named intervals, described by min and max values, that I want to apply to the continuous variable in my large dataset.
Below is some example code:
df<-data.frame(x=c(1:6))
groups<-data.frame(cat=c("a","b","c","d"), min=c(1,2,4,6), max=c(2,4,5,8))
I want to make a new column, df$cat, so that the values of df$x are within the min-max boudaries found in the groups data frame.
Ideally, I would like to have groups$min >= df$x > groups$max.
> df
x cat
1 1 a
2 2 b
3 3 b
4 4 c
5 5 d
6 6 d
Is there any easy way of doing this?

Set up data:
df <- data.frame(x=c(1:6))
groups <- data.frame(cat=c("a","b","c","d"), min=c(1,2,4,6), max=c(2,4,5,8))
Use cut() with the labels argument specified:
brks <- c(groups$min,tail(groups$max,1))
df$cat <- cut(df$x,breaks=brks,labels=groups$cat,right=FALSE)

df<-data.frame(x=c(1:6))
groups<-data.frame(cat=c("a","b","c","d"), min=c(1,2,4,6), max=c(2,4,5,8))
for(i in 1:nrow(groups)){
numbers_in_range = df$x[df$x >= groups[i,]$min & df$x <= groups[i,]$max]
df[,i+1] = df$x %in% numbers_in_range
colnames(df)[2:ncol(df)] = as.character(groups$cat)
}
something like this will tell you which numbers are in which groups ranges. Is this what you were after?

Related

Filtering/subsetting R dataframe based on each rows n'th position value

I have a 'df' with 2 columns:
Combinations <- c(0011111111, 0011113111, 0013113112, 0022223114)
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)
I am trying to find a way to subset or filter the dataframe where the 'Combinations' column's 7th, 8th, and 9th digits equal 311. For the example given, I would expect Combination's 0011113111, 0013113112, 0022223114
There are also instances where I would need to find different combinations, in different nth positions.
I know substring() can find these values for single rows but I'm not sure how to apply it to an entire dataframe.
subtring will work with vectors as well.
subset(df, substring(Combinations, 7, 9) == 311)
# Combinations Values
#2 0011113111 2
#3 0013113112 3
#4 0022223114 4
data
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
Another base R idea:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
df[grep(pattern = "^[0-9]{6}311.$", df$Combinations), ]
Output:
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
As a tip, if you want to know more about regular expressions, this website helps me a lot: https://regexr.com/3elkd
Would this work?
library(dplyr)
library(stringr)
df %>% filter(str_sub(Combinations, 7,9) == 311)
Combinations Values
1 0011113111 2
2 0013113112 3
3 0022223114 4
Not pretty but works:
df[which(lapply(strsplit(df$Combinations, ""), function(x) which(x[7]==3 & x[8]==1 & x[9]==1))==1),]
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
Data:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)

How to get maximum value for a list in a data frame in R

I am trying to create a new column that gets me the maximum value for a list in a data frame. I was wondering how I can create this column called maxvalue from the df$value column i.e., I would like to get the max of that list in the column.
x <- c( "000010011100011111001111111100", "011110", "0000000")
y<- c(1, 2,3)
df<- data.frame(x,y)
library(stringr)
df$value <- strsplit(df$x, "[^1]+", perl=TRUE)
# expected output ( I have tried the following)
df$maxvalue<- max(df$value)
df$maxvalue
8
4
0
this should do the trick
df$value <- lapply(lapply(strsplit(as.character(df$x),"[^1]+"), nchar),max)
output:
> df
x y value
1 000010011100011111001111111100 1 8
2 011110 2 4
3 0000000 3 0
Simplified version of #Daniel O's logic:
df$value <- sapply(strsplit(as.character(df$x),"[^1]+"), function(x){max(nchar(x))})
We can also use rawToChar and charToRaw
sapply(as.character(df$x), function(x)
with(rle(charToRaw(x)), max(lengths[as.character(values) == 31])))

How to create a subset by using another subset as condition?

I want to create a subset using another subset as a condition. I can't show my actual data, but I can show an example that deals with the core of my problem.
For example, I have 10 subjects with 10 observations each. So an example of my data would be to create a simple data frame using this:
ID <- rep(1:10, each = 10)
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(ID,x,y)
Which creates:
ID x y
1 1 0.08146318 0.26682668
2 1 -0.18236757 -1.01868755
3 1 -0.96322876 0.09565239
4 1 -0.64841436 0.09202456
5 1 -1.15244873 -0.38668929
6 1 0.28748521 -0.80816416
7 1 -0.64243912 0.69403155
8 1 0.84882350 -1.48618271
9 1 -1.56619331 -1.30379070
10 1 -0.29069417 1.47436411
11 2 -0.77974847 1.25704185
12 2 -1.54139896 1.25146126
13 2 -0.76082748 0.22607239
14 2 -0.07839719 1.94448322
15 2 -1.53020374 -2.08779769
etc.
Some of these subjects were positive for an event (for example subject 3, 5 and 7), so I have created a subset for that using:
event_pos <- subset(df, ID %in% c("3","5","7"))
Now, I also want to create a subset for the subjects who were negative for an event. I could use something like this:
event_neg <- subset(df, ID %in% c("1","2","4","6","8","9","10"))
The problem is, my data set is too large to specify all the individuals of the negative group. Is there a way to use my subset event_pos to get all the subjects with negative events in one subset?
TL;DR
Can I get a subset_2 by removing the subset_1 from the data frame?
You can use :
ind_list <- c("3","5","7")
event_neg <- subset(df, (ID %in% ind_list) == FALSE)
or
event_neg <- subset(df, !(ID %in% ind_list))
Hope that will helps
Gottaviannoni

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Replicate variable based off match of two other variables in R

I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks
You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]
#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6

Resources