Add a column on a dataframe - r

I have an R problem if you can help.
x <- data.frame("LocationCode" = c("ESC3","RIECAA6","SJHMAU","RIE104","SJH11","SJHAE","RIEAE1","WGH54","RIE205","GSBROB"), "HospitalNumber" = c("701190923R","2905451068","700547389X","AN11295201","1204541612","104010665","800565884R","620063158W","600029720K","1112391223"),"DisciplineName" = c("ESC Biochemistry", "RIE Haematology","SJH Biochemistry","RIE Biochemistry","SJH Biochemistry","WGH Biochemistry","ESC Biochemistry","WGH Biochemistry","SJH Biochemistry","RIE Haematology"))
From the dataframe above i do wish to add a new column (CRN) made up of all "HospitalNumber" rows with 9 digits plus 1 letter at the end (e.g 701190923R), create another column (TIT) with the rest of the rows which does not meet the 1st criteria

You can do this in base using the code
# Identify cases which match 9 digits then one letter
CRMMatch <- grepl("^\\d{9}[[:alpha:]]$", as.character(x$HospitalNumber))
#Create columns from Hospital number among the matches or those that do not match
x$CRN[CRMMatch] <- as.character(x$HospitalNumber)[CRMMatch]
x$TIT[!CRMMatch] <- as.character(x$HospitalNumber)[!CRMMatch]
# clean up by removing the variable created of matches
rm(CRMMatch)
A dplyr version could be
library(dplyr)
x <-
x %>%
mutate(CRN = if_else(grepl("^\\d{9}[[:alpha:]]$", as.character(HospitalNumber)),as.character(HospitalNumber), NA_character_),
TIT = if_else(!grepl("^\\d{9}[[:alpha:]]$", as.character(HospitalNumber)),as.character(HospitalNumber), NA_character_))

You can detect what you need with the instruction
library(stringr)
str_which(x$HospitalNumber,"[:digit:][:alpha:]")
and you get:
> str_which(x$HospitalNumber,"[:digit:][:alpha:]")
[1] 1 3 7 8 9
Then you know what positions you need and what you don't

Quite similar to Kerry Jackson's approach but using ifelse in base R. I have also converted your x$HospitalNumber from factor to character from the start, assuming that this is what you really want:
x[2] <- as.character( x[ , 2 ] )
x$CRN <- ifelse( grepl( "^\\d{9}[[:alpha:]]$", x$HospitalNumber) , x$HospitalNumber, "" )
x$TIT <- ifelse( x$CRN != "", "", x$HospitalNumber )
gives you
> x
LocationCode HospitalNumber DisciplineName CRN TIT
1 ESC3 701190923R ESC Biochemistry 701190923R
2 RIECAA6 2905451068 RIE Haematology 2905451068
3 SJHMAU 700547389X SJH Biochemistry 700547389X
4 RIE104 AN11295201 RIE Biochemistry AN11295201
5 SJH11 1204541612 SJH Biochemistry 1204541612
6 SJHAE 104010665 WGH Biochemistry 104010665
7 RIEAE1 800565884R ESC Biochemistry 800565884R
8 WGH54 620063158W WGH Biochemistry 620063158W
9 RIE205 600029720K SJH Biochemistry 600029720K
10 GSBROB 1112391223 RIE Haematology 1112391223

Related

How to split one column whith multiples delimiters in multiple columns in R?

I have values ​​with the following structure: string OR string_string.interger
EX:
df<-data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
Objs
Windows
Door_XYZ.1
Door_XYY.1
Chair_XYYU.2
Using the command split(), separate() or something similar I need to generate a dataframe similar to this one:
Obs: The split must be performed for the characters "_" and "."
Objs
IND
TAG
Control
Windows
NA
NA
NA
Door_XYZ.1
Door
XYZ
1
Door_XYY.1
Door
XYY
1
Chair_XYYU.2
Chair
XYYU
2
The closest solution was suggested by #Tommy, in similar context.
df %>% data.frame(.,do.call(rbind,str_split(.$Objs,"_")))
The default value of the sep argument in separate() will nearly get the result you need. A conditional mutate was also needed to remove the Windows entry from the IND column.
library(tidyverse)
df <- data.frame(Objs=c("Windows","Door_XYZ.1", "Door_XYY.1", "Chair_XYYU.2" ))
df %>%
separate(Objs, into = c("IND", "TAG", "Control"), remove = FALSE, fill = "right") %>%
mutate(IND = if_else(Objs == IND, NA_character_, IND))
#> Objs IND TAG Control
#> 1 Windows <NA> <NA> <NA>
#> 2 Door_XYZ.1 Door XYZ 1
#> 3 Door_XYY.1 Door XYY 1
#> 4 Chair_XYYU.2 Chair XYYU 2
Created on 2022-05-05 by the reprex package (v1.0.0)

How to do rowsums on a select set of columns containing a string and a number in R?

I have a list of column names that look like this...
colnames(dat)
1 subject
2 e.type
3 group
4 boxnum
5 edate
6 file.name
7 fr
8 active
9 inactive
10 reward
11 latency.to.first.active
12 latency.to.first.inactive
13 act0.600
14 act600.1200
15 act1200.1800
16 act1800.2400
17 act2400.3000
18 act3000.3600
19 inact0.600
20 inact600.1200
21 inact1200.1800
22 inact1800.2400
23 inact2400.3000
24 inact3000.3600
25 rew0.600
26 rew600.1200
27 rew1200.1800
28 rew1800.2400
29 rew2400.3000
30 rew3000.3600
I want to get the row sum for the columns that list act#, inact#, and reward#
This works...
for (row in 1:nrow(dat)) {
dat[row, "active"] = rowSums(dat[row,c(13:18)])
dat[row, "inactive"] = rowSums(dat[row,c(19:24)])
dat[row, "reward"] = rowSums(dat[row,c(25:30)])
}
But I don't want to hard coded it since the number of columns for the 3 sections may change. How can I do this without hard coding the column indexes?
Also, for example, I tried searching for the "act" named columns but it was also including the "active" column.
sub_dat <- dat[, 13:30]
result <- sapply(split.default(sub_dat, substr(names(sub_dat), 1, 3)), rowSums)
dat[, c('active', 'inactive', 'reward')] <- result
Easy-cheesy with witch select & matches from the tidyverse.
library(tidyverse)
data %>%
mutate(
sum_act = rowSums(select(., matches("act[0-9]"))),
sum_inact = rowSums(select(., matches("inact[0-9]"))),
sum_rew = rowSums(select(., matches("rew[0-9]")))
)
I made an example how it could be done:
t <- data.frame(c(1,2,3),c("a","b","c"))
colnames(t) <- c("num","char")
#with function append() you make a list of rows that fulfill your logical argument
whichRows <- append(which(t$char == "a"),which(t$char == "b"))
sum(t$num[whichRows])
or if I misunderstood you and you want to sum for every column separately then:
sum(t$num[which(t$char == "a")])
sum(t$num[which(t$char == "b")])

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df

If else statement with a value that is part of a continuous character in R

My dataframe (df) contains a list of values which are labelled following a format of 'Month' 'Name of Site' and 'Camera No.'. I.e., if my value is 'DECBUTCAM27' then Dec-December, BUT-Name of Site and CAM27-Camera No.
I have 100 such values with 19 different site names.
I want to write an If else code such that only the site names are recognised and a corresponding number is added.
My initial idea was to add the corresponding number for all the 100 values, but since if else does not work beyond 50 values I couldnt use that option.
This is what I had written for the option that i had tried:
df <- df2 %>% mutate(Site_ID =
ifelse (CT_Name == 'DECBUTCAM27', "1",
ifelse (CT_Name == 'DECBUTCAM28', "1",
ifelse (CT_Name == 'DECI2NCAM01', "2",
ifelse (CT_Name == 'DECI2NCAM07', "2",
ifelse (CT_Name == 'DECI5CAM39', "3",
ifelse (CT_Name == 'DECI5CAM40', "3","NoVal")))))))
I am looking for a code such that only the sites i.e., 'BUT', 'I2N' and 'I5' would be recognised and a corresponding number is added.
Any help would be greatly appreciated.
Extract the sitename using regex and use match + unique to assign unique number.
df2$site_name <- sub('...(.*)CAM.*', '\\1', df2$CT_Name)
df2$Site_ID <- match(df2$site_name, unique(df2$site_name))
For example, see this example :
CT_Name <- c('DECBUTCAM27', 'DECBUTCAM28', 'DECI2NCAM07', 'DECI2NCAM01',
'DECI5CAM39', 'DECI5CAM40')
site_name <- sub('...(.*)CAM.*', '\\1', CT_Name)
site_name
#[1] "BUT" "BUT" "I2N" "I2N" "I5" "I5"
Site_ID <- match(site_name, unique(site_name))
Site_ID
#[1] 1 1 2 2 3 3
Here is a tidyverse solution:
You haven't provided a reproducible example, but let's use the CT_Names that you have supplied to create a test dataframe:
data <- tribble(
~ CT_Name,
"DECBUTCAM27",
"DECBUTCAM28",
"DECI2NCAM01",
"DECI2NCAM07",
"DECI5CAM39",
"DECI5CAM40"
)
Let's assume that the string format is 3 letters for months, 2 or more letters or numbers for site and CAM + 1 or more digits for camera number (adjust these as needed). We can use a regular expression in tidyr's extract() function to split up the string into its components:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera"))
(add remove = FALSE if you want to keep the original CT_Name variable)
This yields:
# A tibble: 6 x 3
Month Site Camera
<chr> <chr> <chr>
1 DEC BUT CAM27
2 DEC BUT CAM28
3 DEC I2N CAM01
4 DEC I2N CAM07
5 DEC I5 CAM39
6 DEC I5 CAM40
We can then group by site and assign a group ID as your Site_ID:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera")) %>%
group_by(Site) %>%
mutate(Site_ID = cur_group_id())
This produces:
# A tibble: 6 x 4
# Groups: Site [3]
Month Site Camera Site_ID
<chr> <chr> <chr> <int>
1 DEC BUT CAM27 1
2 DEC BUT CAM28 1
3 DEC I2N CAM01 2
4 DEC I2N CAM07 2
5 DEC I5 CAM39 3
6 DEC I5 CAM40 3
Here is a quick example using regex to find the site code and using an apply function to return a vector of code.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$loc <- apply(df, 1, function(x) gsub("CAM.*$","",gsub("^.{3}",'',x[1])))
unique(df$loc) # all the location of the file
df$n <- as.numeric(as.factor(df$loc)) # get a number for each location
Mind that here I use the x[1] because the code are in the first column of my data.frame, which may vary for you.
---EDIT--- This was a previous answer also working but with more work for you to do. However it allow you to choose numeric code value (or text) to assign locations if they are ordered for example.
It require you to put all the codes for each site, which I found heavy in term of code but it works. The switch part is roughly the same as an ifelse.
The regex consist in excluding the 3 first character and the other ones at the end after the 'CAM' sequence.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$n <- apply(df, 1, function(x) switch(gsub("CAM.*$","",gsub("^.{3}",'',x[1])),
BUT = 1,
DUC = 2)
)

How to create a shared householdID for spouses by referencing two separate character IDs

This quesiton is similar to my previous question (How to create a "householdID" for rows with shared "customerID" and "spouseID"?), although this version deals with a rats-nest mix of character and numeric strings instead of simply numeric IDs. I'm trying to create a "household ID" for all couples who appear in a larger dataframe. In short, each individual has a "customerID" and "spouseID". If a customerID is married, their spouse's ID appears in the "spouseID" column. If they are not married, the spouseID field is empty. Each member of a married couple will appear on its own row, resulting in the need for a common "householdID" that a couple shares.
What is the best way to and add a unique householdID that duplicates for couples? A small and over-simplified example of the original data is as follows. Note that the original IDs are far more complex, with varying lengths and patters of numbers and characters.
df <- data.frame(
prospectID=as.character(c("G1339jf", "6dhd54G1", "Cf14c", "Bvmkm1", "kda-1qati", "pwn9enr", "wj44v04t4t", "D15", "dkfs044nng", "v949s")),
spouseID=as.character( c( "", "wj44v04t4t", "", "pwn9enr", "", "Bvmkm1", "6dhd54G1", "", "v949s", "dkfs044nng")),
stringsAsFactors = FALSE)
> df
prospectID spouseID
1 G1339jf
2 6dhd54G1 wj44v04t4t
3 Cf14c
4 Bvmkm1 pwn9enr
5 kda-1qati
6 pwn9enr Bvmkm1
7 wj44v04t4t 6dhd54G1
8 D15
9 dkfs044nng v949s
10 v949s dkfs044nng
An example of my desired result is as follows:
> df
prospectID spouseID HouseholdID
1 G1339jf 1
2 6dhd54G1 wj44v04t4t 2
3 Cf14c 3
4 Bvmkm1 pwn9enr 4
5 kda-1qati 5
6 pwn9enr Bvmkm1 4
7 wj44v04t4t 6dhd54G1 2
8 D15 6
9 dkfs044nng v949s 7
10 v949s dkfs044nng 7
This is an edited solution due to comments made by OP.
Illustrative data:
df <- data.frame(
prospectID=as.character(c("A1jljljljl344asbvc", "A2&%$ll##fffh", "B1665453sskn:;", "B2gavQWEΩΩø⁄", "C1", "D1", "E1#+'&%", "E255646321", "F1", "G1")),
spouseID=as.character(c("A2&%$ll##fffh", "A1jljljljl344asöbvc", "B2gavQWEΩΩø⁄", "B1665453sskn:;_", "", "", "E255646321", "E1#+'&%", "", "")),
stringsAsFactors = FALSE)
First define a pattern to match:
patt <- paste(df$prospectID, df$spouseID, sep = "|")
Second, define a for loop; here, a little editing is necessary for the first and the last value. Maybe others can improve on this part:
for(i in 1:nrow(df)){
df$HousholdID[1] <- 1
df$HousholdID[i] <- ifelse(grepl(patt[i], df$prospectID[i+1]), 1, 0)
df$HousholdID[10] <- 1
}
The final step is to run cumsum:
df$HousholdID <- cumsum(df$HousholdID)
The result:
df
prospectID spouseID HousholdID
1 A1jljljljl344asbvc A2&%$ll##fffh 1
2 A2&%$ll##fffh A1jljljljl344asöbvc 1
3 B1665453sskn:; B2gavQWEΩΩø⁄ 2
4 B2gavQWEΩΩø⁄ B1665453sskn:;_ 2
5 C1 3
6 D1 4
7 E1#+'&% E255646321 5
8 E255646321 E1#+'&% 5
9 F1 6
10 G1 7

Resources