Formatting strings in a character vector of a data frame

Formatting strings in a character vector of a data frame - r

Suppose I have a data frame (let's call it DF) that looks like this:
options(stringsAsFactors = F)
letters <- c("A", "B", "C", "D", "E")
value <- c(.44, .54, .21, .102, .002)
test <- c("2", "c(1,4)", "1", "3:4", "c(1,2)")
DF <- data.frame(cbind(letters, value, test))
DF$value <- as.numeric(DF$value)
This is what DF looks like if you were to print it:
#DF
# letters value test
#1 A 0.440 2
#2 B 0.540 c(1,4)
#3 C 0.210 1
#4 D 0.102 3:4
#5 E 0.002 c(1,2)
My main issue is DF$test. For any cell that has more than one value (ie: 3:4, c(1,2)), I would like the the cell to have the formating of X:Y , given that X and Y are numeric values.
Can someone help? Please note that DF$test is a character vector.

Another gsub option that uses 2 gsubs:
DF$test2 <- gsub(",",":", gsub(".*c\\((.*)\\).*", "\\1", DF$test))
DF
# letters value test test2
#1 A 0.440 2 2
#2 B 0.540 c(1,4) 1:4
#3 C 0.210 1 1
#4 D 0.102 3:4 3:4
#5 E 0.002 c(1,2) 1:2
The first gsub extracts everything between the c( and ) and the second gsub replaces any , with :. This would work if you had > 2 numbers in your c(). I.e. c(1,2,3) would become 1:2:3.

With:
tst <- gsub('[c()]','',DF$test)
tst <- strsplit(tst, '[,:]')
DF$test <- sapply(tst, paste0, collapse = ':')
or in one go:
DF$test <- sapply(strsplit(gsub('[c()]','',DF$test), '[,:]'), paste0, collapse = ':')
your data.frame now looks like:
> DF
letters value test
1 A 0.440 2
2 B 0.540 1:4
3 C 0.210 1
4 D 0.102 3:4
5 E 0.002 1:2
The advantage of this is that it also works with strings in DF$test that are longer than 2 numbers.

gsub should get you there
DF$test <- gsub(".+(\\d+).(\\d+).+", "\\1:\\2", DF$test)

We can use str_extract
library(stringr)
DF$test <- sapply(str_extract_all(DF$test, '[0-9]+'), paste, collapse=":")
DF$test
#[1] "2" "1:4" "1" "3:4" "1:2"
Or using base R
DF$test <- sapply(regmatches(DF$test, gregexpr('[0-9]+', DF$test)), paste, collapse=":")

Related

How to map values in a dataframe to a key with values specific to each column?

This is a little complicated to explain in words, so I'll just give you a dummy version of my data.
#dummy data for help
col.1<-c(1,2,2,1,1)
col.2<-c(2,1,3,1,2)
col.3<-c(2,4,1,1,2)
df<-data.frame(col.1,col.2,col.3)
key<-c("A","B","C","D","E","F","G","H","I")
names(key)<-c("col.1_1","col.1_2",
"col.2_1","col.2_2","col.2_3",
"col.3_1","col.3_2","col.3_3","col.3_4")
> key
col.1_1 col.1_2 col.2_1 col.2_2 col.2_3 col.3_1 col.3_2 col.3_3 col.3_4
"A" "B" "C" "D" "E" "F" "G" "H" "I"
> df
col.1 col.2 col.3
1 1 2 2
2 2 1 4
3 2 3 1
4 1 1 1
5 1 2 2
If the name of an item in key is col.x_y, this should be interpreted to mean that it should correspond to all values y in column df$col.x. Hence there are as many elements in key as there are unique value-column pairs in df.
My end goal is to replace all ys in df$col.x with the value of col.x_y in key.
For example, since col.2_3 is E, ALL 3s in df$col.2 -- and ONLY in df$col.2 -- should be replaced with E.
So I want to end up with:
> df
col.1 col.2 col.3
1 "A" "D" "G"
2 "B" "C" "I"
3 "B" "E" "F"
4 "A" "C" "F"
5 "A" "D" "G"
Importantly, though, my real key is a named numeric list, and the values are essentially random; I just used ordered letters for the simplicity of the example.
The problem I keep running into is that I can't figure out how to refer to a variable column name. df$variable_name doesn't seem to work in any of the approaches I've tried, and I can't think of a way around it.
Here are a couple of ideas I've come up with:
recodeY<- function(.x,.y) {
split_name<-strsplit(.y, split="_")
score<-split_name[2]
column_name<-split_name[1]
gsub(column_name=score, .x, df)
}
library(purrr)
map2(key, names(key), recodeY)
Not sure if this was the version that ran successfully, but when it did, it came up with 0 values in the output.
This one is just a mess:
for(i in 1:ncol(df)) {
for(j in 1:length(key)) {
col_name<-colnames(df[i])
split_name<-unlist(strsplit(names(j), split="_"))
item_name<-split_name[1]
if(col_name==item_name){
score<-split_name[2]
str_replace(i, ?, ?) #lost track of what I was doing, never done for loops before
#gsub(i, j, finalCYOA2$i) couldn't get gsub or sub to work, ran into the same problem
}
}
}
Any advice on how to proceed? I'd say I'm of intermediate skill with R, but I have a lot of gaps in my knowledge since I self-taught.
EDIT: Since one of the potential solutions was breaking on my real data, here's a more naturalistic example:
col.1A<-c(NA,2,2,1,1)
col.2A<-c(2,1,3,1,NA)
col.2A.1<-c(2,4,NA,1,3)
df<-data.frame(col.1A,col.2A,col.2A.1)
key<-c(1.111,1.222,
2.111,2.222,2.333,
3.111,3.222,3.333,3.444)
names(key)<-c("col.1A_1","col.1A_2",
"col.2A_1","col.2A_2","col.2A_3",
"col.2A.1_1","col.2A.1_2","col.2A.1_3","col.2A.1_4")
key
col.1A_1 col.1A_2 col.2A_1 col.2A_2 col.2A_3 col.2A.1_1 col.2A.1_2 col.2A.1_3 col.2A.1_4
1.111 1.222 2.111 2.222 2.333 3.111 3.222 3.333 3.444
df
col.1A col.2A col.2A.1
1 NA 2 2
2 2 1 4
3 2 3 NA
4 1 1 1
5 1 NA 3

Here is an option in tidyverse. Reshape the 'key' to a long format data with pivot_longer. Then loop across the columns of 'df', extract the values from the corresponding column names and replace by matching the values with the 'grp' column
library(dplyr)
library(tidyr)
new <- key %>%
as.data.frame.list %>%
pivot_longer(cols = everything(), names_to = c(".value", 'grp'),
names_sep="_")
df %>%
mutate(across(everything(), ~ new[[cur_column()]][match(., new$grp)]))
-output
# col.1 col.2 col.3
#1 A D G
#2 B C I
#3 B E F
#4 A C F
#5 A D G
or another option is to loop across the columns of data, create the names of the 'key' by pasting the column name with the value and extract the 'key' based on the name
library(stringr)
df %>%
mutate(across(everything(), ~ key[str_c(cur_column(), "_", .)]))
Or use base R
new <- transform(stack(key), grp = as.integer(sub(".*_", "", ind)),
ind = sub("_.*", "", ind))
df[] <- Map(function(x, y) y$values[match(x, y$grp)], df, split(new[-2], new$ind))
Update
Using the new example, it works well
df %>%
mutate(across(everything(), ~ key[str_c(cur_column(), "_", .)]))
# col.1A col.2A col.2A.1
#1 NA 2.222 3.222
#2 1.222 2.111 3.444
#3 1.222 2.333 NA
#4 1.111 2.111 3.111
#5 1.111 NA 3.333
Or the one with base R
new <- transform(stack(key), grp = as.integer(sub(".*_", "", ind)),
ind = sub("_.*", "", ind))
df[] <- Map(function(x, y) y$values[match(x, y$grp)], df, split(new[-2], new$ind))
df
# col.1A col.2A col.2A.1
#1 NA 2.222 3.222
#2 1.222 2.111 3.444
#3 1.222 2.333 NA
#4 1.111 2.111 3.111
#5 1.111 NA 3.333
Or with pivot_longer
new <- key %>%
as.data.frame.list %>%
pivot_longer(cols = everything(), names_to = c(".value", 'grp'),
names_sep="_")
df %>%
mutate(across(everything(), ~ new[[cur_column()]][match(., new$grp)]))
# col.1A col.2A col.2A.1
#1 NA 2.222 3.222
#2 1.222 2.111 3.444
#3 1.222 2.333 NA
#4 1.111 2.111 3.111
#5 1.111 NA 3.333

paste column names and values together, and match:
df[] <- key[match( paste(names(df)[col(df)], unlist(df), sep="_"), names(key) )]
df
# col.1 col.2 col.3
#1 A D G
#2 B C I
#3 B E F
#4 A C F
#5 A D G

Turn a col into rownames in a list of dataframes

I have a list of dataframes for which I would like to turn one of their cols into the rownames.
(instead of doing this for every df individually).
Unfortunatly I cant get it to work, maybe someone can help?
DF1 <- data.frame(A = c("A", "B", "C"),
B = 1:3)
DF2 <- data.frame(A = c("A", "B", "C"),
B = 1:3)
TheList <- list(DF1 = DF1,
DF2 = DF2)
col_to_rownames_andDel <- function(df){
rownames(df) <- df$A
df$A <- NULL
}
TheList_namedRows <- map(TheList, col_to_rownames_andDel) #not working and empties the dfs
Thanks!
Sebastian

Return the changed dataframe in the last line of the function.
col_to_rownames_andDel <- function(df){
rownames(df) <- df$A
df$A <- NULL
return(df)
}
TheList_namedRows <- purrr::map(TheList, col_to_rownames_andDel)
#Using lapply
#TheList_namedRows <- lapply(TheList, col_to_rownames_andDel)
TheList_namedRows
#$DF1
# B
#A 1
#B 2
#C 3
#$DF2
# B
#A 1
#B 2
#C 3

Ah well... seems I simply forgot the return argument in the function.
It might be useful for others so I keep it here.
col_to_rownames_andDel <- function(df){
rownames(df) <- df$A
df$A <- NULL
return(df)
}
TheList_namedRows <- map(TheList, col_to_rownames_andDel)
Rubber duck debugging at its best.

You can also do:
map(.x = TheList,
~ .x %>%
column_to_rownames("A"))
$DF1
B
A 1
B 2
C 3
$DF2
B
A 1
B 2
C 3

We can use column_to_rownames from tibble
library(tibble)
library(purrr)
map(TheList, column_to_rownames, "A")
#$DF1
# B
#A 1
#B 2
#C 3
#$DF2
# B
#A 1
#B 2
#C 3

How to split a dataframe and attach the splitted part in new column?

I want to split a dataframe by changing values in the first column and afterward attach the split part in a new column. An example is given below. However, I end up with a list that I can't process back to a handy dataframe.
the desired output should look like df_goal, which is not yet properly formatted.
#data
x <-c(1,2,3)
y <-c(20200101,20200101,20200101)
z <-c(4.5,5,7)
x_name <- "ID"
y_name <- "Date"
z_name <- "value"
df <-data.frame(x,y,z)
names(df) <- c(x_name,y_name,z_name)
#processing
df$date <-format(as.Date(as.character(df$date), format="%Y%m%d"))
df01 <- split(df, f = df$ID)
#goal
a <-c(1)
b <-c(20200101)
c <-c(4.5)
d <-c(2)
e <-c(20200101)
f <-c(5)
g <-c(3)
h <-c(20200101)
i <-c(7)
df_goal <- data.frame(a,b,c,d,e,f,g,h,i)

You can use Reduce and cbind to cbind each row of a data.frame in one row and keep the type of the columns.
Reduce(function(x,y) cbind(x, df[y,]), 2:nrow(df), df[1,])
# ID Date value ID Date value ID Date value
#1 1 20200101 4.5 2 20200101 5 3 20200101 7
#Equivalent for the sample dataset: cbind(cbind(df[1,], df[2,]), df[3,])
or do.call with split:
do.call(cbind, split(df, 1:nrow(df)))
# 1.ID 1.Date 1.value 2.ID 2.Date 2.value 3.ID 3.Date 3.value
#1 1 20200101 4.5 2 20200101 5 3 20200101 7
#Equivalent for the sample dataset: cbind(df[1,], df[2,], df[3,])
In case you have several rows per ID you can try:
x <- split(df, df$ID)
y <- max(unlist(lapply(x, nrow)))
do.call(cbind, lapply(x, function(i) i[1:y,]))

This is a possible solution for your example :
new_df = data.frame(list(df[1,],df[2,],df[3,]))
And if you want to generalize that on a bigger data.frame :
new_list = list()
for ( i in 1:dim(df)[1] ){
new_list[[i]] = df[i,]
}
new_df = data.frame(new_list)

One option could be:
setNames(Reduce(c, asplit(df, 1)), letters[1:Reduce(`*`, dim(df))])
a b c d e f g h i
1.0 20200101.0 4.5 2.0 20200101.0 5.0 3.0 20200101.0 7.0

Maybe you can try the following code
df_goal <- data.frame(t(c(t(df))))
such that
> df_goal
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 20200101 4.5 2 20200101 5 3 20200101 7

Error when unlisting columns in a data frame

Suppose I have a data frame called DF:
options(stringsAsFactors = F)
letters <- list("A", "B", "C", "D")
numbers <- list(list(1,2), 1, 1, 2)
score <- list(.44, .54, .21, .102)
DF <- data.frame(cbind(letters, numbers, score))
Note that all columns in the data frame are of class "list".
Also, take a look at the structure: DF$numbers[1] is also a list
I'm trying to UNLIST each column.
DF$letters <- unlist(DF$letters)
DF$score <- unlist(DF$score)
DF$numbers <- unlist(DF$numbers)
However, because, DF$numbers[1] is also a list, I'm thrown back this error:
Error in `$<-.data.frame`(`*tmp*`, numbers, value = c(1, 2, 1, 1, 2)) :
replacement has 5 rows, data has 4
Is there a way that I can unlist the whole column, and keep the values cells like DF$numbers[1] as a character vector like c(1,2) or 1,2?
Ideally I would like DF to look something like this, where the individual values in the number column are still of type int:
letters numbers score
A 1,2 .44
B 1 .54
C 1 .21
D 2 .102
The goal is to then write the data frame to a csv file.

You can apply unlist to each individual element of the column numbers instead of the whole column:
DF$numbers <- lapply(DF$numbers, unlist)
DF
# letters numbers value
#1 A 1, 2 0.440
#2 B 1 0.540
#3 C 1 0.210
#4 D 2 0.102
DF$numbers[1]
#[[1]]
#[1] 1 2
Or paste the elements as a single string if you want an atomic vector column:
DF$numbers <- sapply(DF$numbers, toString)
DF
# letters numbers value
#1 A 1, 2 0.44
#2 B 1 0.54
#3 C 1 0.21
#4 D 2 0.102
DF$numbers[1]
#[1] "1, 2"
class(DF$numbers)
# [1] "character"

You can do:
DF$letters <- unlist(DF$letters)
DF$value <- unlist(DF$value)
DF$numbers <- unlist(as.character(DF$numbers))
This returns:
DF
letters numbers value
1 A c(1, 2) 0.440
2 B 1 0.540
3 C 1 0.210
4 D 2 0.102

How to create logical variable based on logical condition?

I have a data frame with factor variables
> a <- c("a", "b", "c")
> b <- c("c", "b", "a")
> df <- as.data.frame(cbind(a,b))
> df$a <- as.factor(df$a)
> df$b <- as.factor(df$b)
> df
a b
1 a c
2 b b
3 c a
I create new logical variable based on the similarity of var a and var b.
> df$result <- isTRUE(df$a == df$b)
But I get the result:
> df
a b result
1 a c FALSE
2 b b FALSE
3 c a FALSE
When I expected
> df
a b result
1 a c FALSE
2 b b TRUE
3 c a FALSE
(I'm using factors to replicate my real data)
What am I doing wrong? How can I achieve my goal of identifying similar variables? Thanks

Just do
df$result <- with(df, a==b)
df
# a b result
#1 a c FALSE
#2 b b TRUE
#3 c a FALSE
The a==b already returns a logical vector and we don't need isTRUE to wrap it.
As #Frank mentioned in the comments, it is better to evaluate between character class columns as difference in factor levels can result in error. We can either convert the factor to character for evaluating
with(df, as.character(a)==as.character(b))
or make the levels the same as in both columns
Un1 <- union(levels(df$a), levels(df$b))
df[] <- lapply(df, factor, levels=Un1)
with(df, a==b)