R: extract the last two numbers in a variable - r

I have two datasets (data1 and data2).
Data1 has (one of many) a column named: B23333391
Data2 has a column called id_number, where id numbers are listed (e.g. 344444491)
I need to extract the last two digits (91) from the variable in data1 and merge it with the last two digits of the id number in data2 in column id_number
Since the last two digits represents an individual.
E.g.:
Data1:
columns: -> B23333391..... and so on
Data2:
columns: -> id_number
344444491
and so on....
How can this be done?
Thanks in advance!

Try this approach. You can use a dplyr pipeline to format an id variable in both dataframes using substr(). The last two digits can be extracted with nchar(). After that you can merge using left_join(). Here the code with simulated data similar to those shared by you:
library(dplyr)
#Data
df1 <- data.frame(Var1=c('B23333391'),Val1=1,stringsAsFactors = F)
df2 <- data.frame(Varid=c('344444491'),Val2=1,stringsAsFactors = F)
#Merge
dfnew <- df1 %>%
mutate(id=substr(Var1,nchar(Var1)-1,nchar(Var1))) %>%
left_join(df2 %>% mutate(id=substr(Varid,nchar(Varid)-1,nchar(Varid))))
Output:
Var1 Val1 id Varid Val2
1 B23333391 1 91 344444491 1

Related

How to add a column that identifies groups of consecutive days

In a data.frame, I would like to add a column that identifies groups of consecutive days.
I think I need to start by converting my strings to date format...
Here's my example :
mydf <- data.frame(
var_name = c(rep("toto",6),rep("titi",5)),
date_collection = c("09/12/2022","10/12/2022","13/12/2022","16/12/2022","16/12/2022","17/12/2022",
"01/12/2022","03/11/2022","04/11/2022","05/11/2022","08/11/2022")
)
Expected output :
Convert to Date class and do the adjacent diff to create a a logical vector and take the cumulative sum
library(dplyr)
library(lubridate)
mydf %>%
mutate(id = cumsum(c(0, abs(diff(dmy(date_collection)))) > 1)+1)

How to subset the data frame based on selected variable with limited column?

i would like to subset limited column and selected variable as i have multiple column in my data frame.
my sample data:
df <- data.frame('ID'=c('A','B','C'),'YEAR'=c('2020','2020','2020'),'MONTH'=c('1','1','1'),'DAY'=c('16','16','16'),'HOUR'=c('15','15','15'),'VALUE1'=c(1,2,3))
i would like to subset ID'='C' and column name 'VALUE1'
Expected output:-
ID VALUE1
1 C 3
Appreciate any help...!
What i have tried so far is.
df1 <- subset(df,df$ID=='C')
df2 <- subset(df1,select=c('ID','VALUE1')
Is there any efficient way to do that as creating multiple data frame when we have multiple is not good.
you can use dplyr chaining function too,
df %>% select(ID,VALUE1) %>% filter(ID=="C")
We can have both subset and select
subset(df, subset = ID=='C', select = c('ID', 'VALUE1'))

Splitting data frame in r into two

I have a data frame that has one column 'Date2' as modulus of 5 of another data column 'Date'. I want to split data into two data frames, one containing all values where modulus is 0 and 2nd all others.
Here is my code that is working on this reproducible code. Though as I have to apply it on a big data, I want to know it is appropriate way for this purpose.
Here is my code:
DD<-seq(as.Date("2019/01/01"), by = "day", length.out =31) #creating data for df
DD<-DD
DD2 <- data.frame("Date"=DD, var = c(1:31)) # reproducible df for testing
DD2<-DD2
DD2<-DD2%>%
mutate(Date2=mday(Date)%%5)# getting modulus of Date col in Date2 col
DD2
D3<-split(DD2, DD2$Date2==0) #all records with 0 remainder of 5
D4<-split(DD2, DD2$Date2!=0) # all other records
D3
D4
If we need to split, we can also use group_split
DD2 %>%
group_split(new = as.integer(!mday(Date) %% 5))
Or with split
DD3 <- split(DD2, DD2$Date2==0)
and then extract the elements with [[
DD3[["TRUE"]]
DD3[["FALSE"]]
But, it could be also created as a grouping variable instead of splitting into multiple datasets
DD2 %>%
mutate(grp = as.integer(!mday(Date) %% 5))

R combine / merge two columns to one column

I have two columns (class: character) in a data.frame that include large numbers (e.g. column A: 999967258082532415; columns B: 999967258082532415). I want a new columns C that combines the two numbers:999967258082532415999967258082532415
I use:
data_1$visit_id <- do.call(paste, c(data_1[c("post_visid_high", "post_visid_low")], sep = ""))
But my new column gets converted to factor, but I still want a character. What can i do?
I created a sample dataset that resembles yours:
df <- data.frame(col_A = c(2314325435454354,123098213728903214,12329042374094),
col_B = c(9034832054097390485,30945743504375043,234903284304))
Using dplyr, create a new column (column C) that concatenates the other two columns, followed by mutating all columns to character data type:
library(dplyr)
df <- df %>%
mutate(col_C = col_A + col_B) %>%
mutate_all(funs(as.character(.)))

What is the elegant way to select n latest (by date) entries in data.frame in R?

I have the following data frame (just an example)
Date StudentID Gender Grade
The data frame is unbalanced in a sense that there are significantly more males than females. I need to select from the data frame all females and the same number of males with the latest date entries. The dates are given as Date type. The data frame is unsorted and there are multiple rows that may have the same date.
What is the most elegant way to perform this task?
Supposing dat is your dataframe and it is ordered by Date, you could use:
rbind(tail(dat[dat&Gender=="Male",], 10),
tail(dat[dat&Gender=="Female",], 10))
or:
library(data.table)
setDT(dat)[, tail(.SD, 10) , by = Gender]
or:
library(dplyr)
dat %>% group_by(Gender) %>% do(tail(., 10))
Each will select the last 10 cases for both groups.
Here is how you can create a data frame for the males:
# subset all male records
df1 <- df[df$Gender == 'Male', ]
# sort by date in descending order (most recent first)
df2 <- df1[rev(order(df1$Date)),]
# retain same number of rows as number of females
df.male <- df2[1:sum(df$Gender == 'Female'), ]
To create a data frame for the females, you just need this:
df.female <- df[df$Gender == 'Female', ]
You can combine them using this:
df.all <- rbind(df.male, df.female)
Note that I assume that your Date column is already actually of class Date and not something else, like a factor or character. In the event that it is not a date, then you will have to convert it first in order to sort by the date.

Resources