I have a dataset where different values can only be classified by the occurrence of the digit 1. All values consist of 5 digits. Now I need to create a new variable that groups the values. My question now is, whether there is a way similiar to Excel to set placeholders in order to identify those values that start with 1.
What I have done so far is:
w$r <- ifelse(w$f == 1****, 1, 0)
Here, I wanted to filter out all values where 1 is the first digit.
It is noteworthy that some values have a reoccuring 1, i.e. on 2 digits.
All variables have either a 1 in them or are zero.
Examples for data are 00000, 00001, 11100 etc. The goal is to create a variable for every 1 at a different position. E.g. First digit one should be a variable, but also a variable were the 1 occurs as the first and third digit needs to be accounted for in the created variable 1 and variable 3.
EDIT:
Not quite sure whether that's what you want, but here's a try:
Data:
Since you also seem to have data with leading zeros you need to convert them to character:
df <- data.frame(w = c("00000", "00001", "11100", "10010", "11000", "10000", "10100", "00100", "10001"))
Solution:
# variable for "1" in first position:
df$r1 <- ifelse(grepl("^1", df$w), 1, 0)
# variable for "1" in second position:
df$r2 <- ifelse(grepl("^\\d1", df$w), 1, 0)
# variable for "1" in third position:
df$r3 <- ifelse(grepl("^\\d{2}1", df$w), 1, 0)
# variable for "1" in fourth position:
df$r4 <- ifelse(grepl("^\\d{3}1", df$w), 1, 0)
# variable for "1" in fifth position:
df$r5 <- ifelse(grepl("^\\d{4}1", df$w), 1, 0)
Result:
df
w r r2 r3 r4
1 00000 0 0 0 0
2 00001 0 0 0 1
3 11100 1 1 1 0
4 10010 1 0 0 0
5 11000 1 1 0 0
6 10000 1 0 0 0
7 10100 1 0 1 0
8 00100 0 0 1 0
9 10001 1 0 0 1
Related
Does anyone have an idea how to generate column of random values where only one random row is marked with number "1". All others should be "0".
I need function for this in R code.
Here is what i need in photos:
df <- data.frame(subject = 1, choice = 0, price75 = c(0,0,0,1,1,1,0,1))
This command will update the choice column to contain a single random row with value of 1 each time it is called. All other rows values in the choice column are set to 0.
df$choice <- +(seq_along(df$choice) == sample(nrow(df), 1))
With integer(length(DF$choice)) a vector of 0 is created where [<- is replacing a 1 on the position from sample(length(DF$choice), 1).
DF <- data.frame(subject=1, choice="", price75=c(0,0,0,1,1,1,0,1))
DF$choice <- `[<-`(integer(nrow(DF)), sample(nrow(DF), 1L), 1L)
DF
# subject choice price75
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 1 1 1
#5 1 0 1
#6 1 0 1
#7 1 0 0
#8 1 0 1
> x <- rep(0, 10)
> x[sample(1:10, 1)] <- 1
> x
[1] 0 0 0 0 0 0 0 1 0 0
Many ways to set a random value in a row\column in R
df<-data.frame(x=rep(0,10)) #make dataframe df, with column x, filled with 10 zeros.
set.seed(2022) #set a random seed - this is for repeatability
#two base methods for sampling:
#sample.int(n=10, size=1) # sample an integer from 1 to 10, sample size of 1
#sample(x=1:10, size=1) # sample from 1 to 10, sample size of 1
df$x[sample.int(n=10, size=1)] <- 1 # randomly selecting one of the ten rows, and replacing the value with 1
df
I have a df with column which contains different codes (ICD-10). The column contains codes which consists of 4 alpha numeric characters. I want to search for specific codes based on just the first two characters. For example if this is the column
codes = c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
and I want all those rows where it contains S0, S1, T0, T9, T1 and assign it one and 0 if not present. I previously have used %like% with case_when. However, I would like to know if there an efficient way to do this in R.Thanks
Use grepl() to test for a regular expression and return true for any string that starts with s0, s1, T0, T1, T9 and otherwise false. Then ifelse() to take that vector of TRUEs and FALSEs and assigned 1 for the TRUEs, otherwise 0.
codes <- c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
ifelse(grepl("^s[01]|^T[019]", codes), 1, 0)
Output:
[1] 1 1 0 1 1 0 0 0
Can also do:
as.numeric(grepl("^s[01]|^T[019]", codes))
We can use
+(grepl("^s[01]|^T[019]", codes))
[1] 1 1 0 1 1 0 0 0
We could define a pattern you want to detect and then use str_detect and assign 1 to TRUE and 0 to FALSE:
library(dplyr)
library(stringr)
# your dataframe with codes column
df <- data.frame(codes = c("s001", "s1234", "s4g6",
"T002", "T191","t985",
"s761","t17.5"))
# define what you want to search for
search_pattern <- "S0|S1|T0|T9|T1"
# check with `str_detect`
df %>%
mutate(check = ifelse(str_detect(df$codes, search_pattern)==TRUE, 1, 0))
Output:
codes check
1 s001 0
2 s1234 0
3 s4g6 0
4 T002 1
5 T191 1
6 t985 0
7 s761 0
8 t17.5 0
Another option with grepl
> +grepl("^([sT][01]|T9)", codes)
[1] 1 1 0 1 1 0 0 0
You can also use the substring approach. Extract only first 2 characters from the codes using substr and compare it against the correct_values.
codes = c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
correct_values <- c("s0", "s1", "T0", "T9", "T1")
as.integer(substr(codes, 1, 2) %in% correct_values)
#[1] 1 1 0 1 1 0 0 0
I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0
I need to check whether the number of elements of each unique value in the variable PPT in A is equal to the number of elements of each unique value in PPT in B, and whether there is any value unique only to A or only to B.
For example:
PPTa <- c("ppt0100109","ppt0301104","ppt0100109","ppt0100109","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0504409","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNa <- c(110,54,110,110,49,10,49,110,409,40,10,10,110)
LLa <- c(150,55,150,150,45,15,45,115,405,45,5,15,50)
A <-data.frame(PPTa,CNa,LLa)
PPTb <- c("ppt0100200","ppt0300249","ppt0100109","ppt0300249","ppt0100109","ppt0764091","ppt2303401","ppt0704210","ppt0704210","ppt0100109")
CNb <- c(110,54,110,110,49,10,49,110,409,40)
LLb <- c(150,55,150,150,45,15,45,115,405,45)
B <-data.frame(PPTb,CNb,LLb)
In this case, we have these unique values which occur a certain amount of times:
A$PPTa TIMES
"ppt0100109" 6
"ppt0301104" 1
"ppt0300249" 2
"ppt0504409" 1
"ppt2303401" 1
"ppt0704210" 2
B$PPTb TIMES
"ppt0100200" 1
"ppt0300249" 2
"ppt0100109" 3
"ppt0764091" 1
"ppt2303401" 1
"ppt0704210" 2
I would like to create a new matrix (or anything you could suggest) with a value of 0 if the unique value exists both in A and B with the same number of elements, a value of 1 if it exists in both dataframes A and B but the number of elements differ, and a value of 2 if the value exists only in one of the two dataframes.
Something like:
A$PPTa TIMES OUTPUT
"ppt0100109" 6 1
"ppt0301104" 1 2
"ppt0300249" 2 0
"ppt0504409" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
B$PPTb TIMES OUTPUT
"ppt0100200" 1 2
"ppt0300249" 2 0
"ppt0100109" 3 1
"ppt0764091" 1 2
"ppt2303401" 1 0
"ppt0704210" 2 0
You can use a nested ifelse statement,
ifelse(do.call(paste0, A) %in% do.call(paste0, B), 0, ifelse(A$PPTa %in% B$PPTb, 1, 2))
#[1] 1 0 2 2 0 0
ifelse(do.call(paste0, B) %in% do.call(paste0, A), 0, ifelse(B$PPTb %in% A$PPTa, 1, 2))
#[1] 1 2 0 0 2 0
I would like to fill a dataframe ("DF") with 0's or 1's depending if values in a vector ("Date") match with other date values in a second dataframe ("df$Date").
If they match the output value have to be 1, otherwise 0.
I tried to adjust this code made by a friend of mine, but it doesn't work:
for(j in 1:length(Date)) { #Date is a vector with all dates from 1967 to 2006
# Start count
count <- 0
# Check all Dates between 1967-2006
if(any(Date[j] == df$Date)) { #df$Date contains specific dates of interest
count <- count + 1
}
# If there is a match between Date and df$Date, its output is 1, else 0.
DF[j,i] <- count
}
The main dataframe "DF" has got 190 columns, which have to filled, and of course a number of rows equal to the Date vector.
extra info
1) Each column is different from the other ones and therefore the observations in a row cannot be all equal (i.e. in a single row, I should have a mixture between 0's and 1's).
2) The column names in "DF" are also present in "df" as df$Code.
We can vectorize this operation with %in% and as.integer(), leveraging the fact that coercing logical to integer returns 0 for false and 1 for true:
Mat[,i] <- as.integer(Date%in%df$Date);
If you want to fill every single column of Mat with the exact same result vector:
Mat[] <- as.integer(Date%in%df$Date);
My above code exactly reproduces the logic of the code in your (original) question.
From your edit, I'm not 100% sure I understand the requirement, but my best guess is this:
set.seed(4L);
LV <- 10L; ND <- 10L;
Date <- sample(seq_len(ND),LV,T);
df <- data.frame(Date=sample(seq_len(ND),3L),Code=c('V1','V2','V3'));
DF <- data.frame(V1=rep(NA,NV),V2=rep(NA,NV),V3=rep(NA,NV));
Date;
## [1] 6 1 3 3 9 3 8 10 10 1
df;
## Date Code
## 1 8 V1
## 2 3 V2
## 3 1 V3
for (cn in colnames(DF)) DF[,cn] <- as.integer(Date%in%df$Date[df$Code==cn]);
DF;
## V1 V2 V3
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 1 0
## 5 0 0 0
## 6 0 1 0
## 7 1 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 1