order a list of stings in r - r

The data I have include two variables: id and income (a list of characters)
id <- seq(1,6)
income <- c("2322;5125",
"0110;2012",
"2212;0912",
"1012;0145",
"1545;1102",
"1010;2028")
df <- data.frame(id, income)
df$income <- as.character(df$income)
I need to add a third column income_order which includes the ordered values of column income. The final output would look like
NOTE: I would still need to keep the leading zeros

We could split the string on ";", sort and paste the string back.
df$income_order <- sapply(strsplit(df$income, ";"), function(x)
paste(sort(x), collapse = ";"))
df
# id income income_order
#1 1 2322;5125 2322;5125
#2 2 0110;2012 0110;2012
#3 3 2212;0912 0912;2212
#4 4 1012;0145 0145;1012
#5 5 1545;1102 1102;1545
#6 6 1010;2028 1010;2028

We can use gsubfn
library(gsubfn)
df$income_order <- gsubfn("(\\d+);(\\d+)", ~ paste(sort(c(x, y)), collapse=";"), df$income)
df$income_order
#[1] "2322;5125" "0110;2012" "0912;2212" "0145;1012" "1102;1545" "1010;2028"

Related

How to add missing zeros in a unique identifier that is missing some values using R?

I have a unique id that should in total contain 13 characters, 15 with dash. It should look like this
2005-067-000043
However some entries might be like this
2005-067-00043 or 2005-67-000043 or 2005-067-0000043
I would like a script that says between first and second dash there should be three characters, if more cut zeros in front and if less add zero in front. Same goes for the last section where it says after last dash there should be six characters if less add zero in front or if more cut zero in front.
You can split up the data into 3 columns, keep only 3 and 6 characters in 2nd and 3rd column and combine the columns into one again.
library(dplyr)
library(tidyr)
separate(df, x, paste0('col', 1:3), sep = '-') %>%
mutate(col2 = sprintf('%03s', substring(col2, nchar(col2) - 2)),
col3 = sprintf('%06s', substring(col3, nchar(col3) - 5))) %>%
unite(result, starts_with('col'), sep = '-')
# result
#1 2005-067-000043
#2 2005-067-000043
#3 2005-067-000043
#4 2005-067-000043
x <- c('2005-067-000043', '2005-067-00043', '2005-67-000043', '2005-067-0000043')
df <- data.frame(x)
df
# x
#1 2005-067-000043
#2 2005-067-00043
#3 2005-67-000043
#4 2005-067-0000043

Filter rows where any values in a vector are contained in a column

I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg
Identifier Codes
1 A14 R17
2 R069 D136 B08
3 C11 K71 V91
I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80"). I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.
Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes) returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.
Is there a way to get all the rows where any of these codes are present in the column?
One trick is to combine the goodcodes into a regular expression:
library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"
FYI, the \\b( and )\\b are absolutely necessary if there's a chance that you will have codes A10 and A101; without \\b(...)\\b, then grepl("A10", "A101") will be a false-positive. See
grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE
Finally, let's use that ptn:
dat %>%
filter(grepl(ptn, Codes))
# Identifier Codes
# 1 1 A14 R17
# 2 3 C11 K71 V91
Another way is to split the Codes column into a list of individual codes, and look for membership with %in%:
sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1] TRUE FALSE TRUE
Depending on how complex things are, a third way is to "unnest" Codes and look for matches.
dat %>%
mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
tidyr::unnest(Codes) %>%
group_by(Identifier) %>%
filter(any(Codes %in% goodcodes)) %>%
ungroup()
# # A tibble: 5 x 2
# Identifier Codes
# <dbl> <chr>
# 1 1 A14
# 2 1 R17
# 3 3 C11
# 4 3 K71
# 5 3 V91
(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " ")). I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)
With subset from base R. Loop over the 'goodcodes' vector, use that as pattern in grepl, Reduce the list of logical vectors into a single logical vector to subset the rows
subset(dat, Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
# Identifier Codes
#1 1 A14 R17
#3 3 C11 K71 V91
data
dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08",
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))

Concatenate columns in data frame

We have brands data in a column/variable which is delimited by semicolon(;). Our task is to split these column data to multiple columns which we were able to do with the following syntax.
Attached the data as Screen shot.
Data set
Here is the R code:
x<-dataset$Pref_All
point<-df %>% separate(x, c("Pref_01","Pref_02","Pref_03","Pref_04","Pref_05"), ";")
point[is.na(point)] <- ""
However our question is: We have this type of brands data in more than 10 to 15 columns and if we use the above syntax the maximum number of columns to be split is to be decided on the number of brands each column holds (which we manually calculated and taken as 5 columns).
We would like to know is there any way where we can write the code in a dynamic way such that it should calculate the maximum number of brands each column holds and accordingly it should create those many new columns in a data frame. for e.g.
Pref_01,Pref_02,Pref_03,Pref_04,Pref_05.
the preferred output is given as a screen shot.
Output
Thanks for the help in advance.
x <- c("Swift;Baleno;Ciaz;Scross;Brezza", "Baleno;swift;celerio;ignis", "Scross;Baleno;celerio;brezza", "", "Ciaz;Scross;Brezza")
strsplit(x,";")
library(dplyr)
library(tidyr)
x <- data.frame(ID = c(1,2,3,4,5),
Pref_All = c("S;B;C;S;B",
"B;S;C;I",
"S;B;C;B",
" ",
"C;S;B"))
x$Pref_All <- as.character(levels(x$Pref_All))[x$Pref_All]
final_df <- x %>%
tidyr::separate(Pref_All, c(paste0("Pref_0", 1:b[[which.max(b)]])), ";")
final_df$ID <- x$Pref_All
final_df <- rename(final_df, Pref_All = ID)
final_df[is.na(final_df)] <- ""
Pref_All Pref_01 Pref_02 Pref_03 Pref_04 Pref_05
1 S;B;C;S;B S B C S B
2 B;S;C;I B S C I
3 S;B;C;B S B C B
4
5 C;S;B C S B
The trick for the column names is given by paste0 going from 1 to the maximum number of brands in your data!
I would use str_split() which returns a list of character vectors. From that, we can work out the max number of preferences in the dataframe and then apply over it a function to add the missing elements.
df=data.frame("id"=1:5,
"Pref_All"=c("brand1", "brand1;brand2;brand3", "", "brand2;brand4", "brand5"))
spl = str_split(df$Pref_All, ";")
# Find the max number of preferences
maxl = max(unlist(lapply(spl, length)))
# Add missing values to each element of the list
spl = lapply(spl, function(x){c(x, rep("", maxl-length(x)))})
# Bind each element of the list in a data.frame
dfr = data.frame(do.call(rbind, spl))
# Rename the columns
names(dfr) = paste0("Pref_", 1:maxl)
print(dfr)
# Pref_1 Pref_2 Pref_3
#1 brand1
#2 brand1 brand2 brand3
#3
#4 brand2 brand4
#5 brand5

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Create long character strings in name levels

I'd like to do a modification in levels names by a rule, but I have a problem below:
my data; intially df variable was class matrix I changed it
df <- data.frame(x = c("P27C", "P31B", "P12E", "P3E", "P7A", "P7D", "P2A", "P7D",
"P34", "P10C"),
y = rnorm(10), stringsAsFactors = F)
s<-c("P27CvsP31B","P27CvsP3C","P27CvsP3E","P27CvsP6B","P27CvsP7A","P27CvsP7C",
"P27DvsP27E","P27DvsP2B","P27DvsP31A","P27DvsP31B","P27DvsP3D","P27DvsP7D",
"P27EvsP2A","P27EvsP2B","P27EvsP2E","P27EvsP2F","P27EvsP2G","P27EvsP34",
"P7AvsP7H","P7BvsP7D","P7CvsP7G","P7DvsP7E","P7DvsP7F","P7DvsP7G","P7DvsP7H")
df
df$z <- lapply(df$x, grep, s, value = T)
# gives you the matches but empty slots for a missing value like "P12E"
df
for (r in 1:nrow(df)) {
if (length(df$z[[r]]) == 0) {
df$z[[r]] <- df$x[[r]]
}
else {
df$z[[r]] <- df$z[[r]]
}
}
# restores the original name of unmatched values
df$z
#Rename but in list format!!!
and my desired output is:
x y z
1 P27C 2.22354499 "P27CvsP31B, P27CvsP3C, P27CvsP3E, P27CvsP6B, P27CvsP7A, P27CvsP7C"
2 P31B 0.89197064 "P27CvsP31B, P27DvsP31B"
3 P12E -0.02313754 "P12E"
4 P3E 0.69916446 "P27CvsP3E"
5 P7A -0.44895512 "P27CvsP7A, P7AvsP7H"
6 P7D 1.77619979 "P27DvsP7D, P7BvsP7D, P7DvsP7E, P7DvsP7F, P7DvsP7G, P7DvsP7H"
7 P2A -0.18261732 "P27EvsP2A"
8 P7D 0.12025524 "P27DvsP7D, P7BvsP7D, P7DvsP7E, P7DvsP7F, P7DvsP7G, P7DvsP7H"
9 P34 -0.13434265 "P27EvsP34"
10 P10C 0.19971201 "P10C"
Thanks
Looks a bit ugly with the nested sapply. It loops over x column of your df and matches all the entries with your vector s creating a list of the matched results. The second sapply loops over that list and pastes all the entries. If there is no match, then it returns an empty cell which we handle by substituting the df$x entry at its place.
df$z <- sapply(sapply(df$x, function(i) s[grepl(i, s)]), paste, collapse = ',')
df$z[df$z == ''] <- df$x[df$z == '']
df
# x y z
#1 P27C -0.95290496 P27CvsP31B,P27CvsP3C,P27CvsP3E,P27CvsP6B,P27CvsP7A,P27CvsP7C
#2 P31B 1.62237939 P27CvsP31B,P27DvsP31B
#3 P12E 2.60014202 P12E
#4 P3E 0.13964851 P27CvsP3E
#5 P7A -1.35071967 P27CvsP7A,P7AvsP7H
#6 P7D 0.79893102 P27DvsP7D,P7BvsP7D,P7DvsP7E,P7DvsP7F,P7DvsP7G,P7DvsP7H
#7 P2A -1.55499584 P27EvsP2A
#8 P7D 0.46372006 P27DvsP7D,P7BvsP7D,P7DvsP7E,P7DvsP7F,P7DvsP7G,P7DvsP7H
#9 P34 0.05242956 P27EvsP34
#10 P10C -0.20203180 P10C
EDIT
Based on #akrun's suggestion, an option with data.table would be,
library(data.table)
setDT(df)[, z := unlist(lapply(x, function(y) toString(grep(y, s, value = TRUE))))][z=="", z := x][]

Resources