Splitting a column based on select character? - r

I have a dataframe with many columns. For one of the columns ('cols'), it roughly has this structure:
'x\y\z'
Some of the rows are 'x\y\z' and others are 'x\y'. I am only interested in the 'y' portion of the row.
I have been looking through various posts on stackoverflow by people with similar questions, but I have not been able to find a solution that works. The closest that I got was this (which resulted in an error):
x = strsplit(df['cols'], "\")
I have a feeling I may not be utilizing a package correctly. Any help would be great!
Edit: Included sample structure and expected output
Current structure:
cols
'test\foo\bar'
'test\foo'
'test\bar'
'test\foo\foo'
Expected output:
cols
'foo'
'foo'
'bar'
'foo'

We need to escape
df$cols <- sapply(strsplit(df$cols, "\\\\"), `[`, 2)
df$cols
#[1] "foo" "foo" "bar" "foo"
Or with sub
sub("^\\w+.(\\w+).*", "\\1", df$cols)
#[1] "foo" "foo" "bar" "foo"
data
df <- structure(list(cols = c("test\\foo\\bar", "test\\foo", "test\\bar",
"test\\foo\\foo")), .Names = "cols", class = "data.frame", row.names = c(NA,
-4L))

You can have a look at a great package for data manipulation: tidyr
Then:
df = tidyr::separate(df, col = cols, into = c("x", "y", "z"), sep="\\\\")
(note the escaped backslash)

Related

Extract numbers from a character vector and adding leading zeros

I have a character-vector with the following structure:
GDM3
PER.1.1.1_1
PER.1.10.2_1
PER.1.1.32_1
PER.1.1.4_1
PER.1.1.5_1
PER.11.29.1_1
PER.1.2.2_1
PER.31.2.3_1
PER.1.2.44_1
PER.5.2.25_1
I want to extract the three numbers in the middle of middle of that ID and add leading numbers if they are only single digits. The finale vector can be a character vector again. In the end the result should look like this:
GDM3
010101
011002
010132
010104
010105
112901
010202
310203
010244
050225
tmp <- strcapture("\\.([0-9]+)\\.([0-9]+)\\.([0-9]+)_", X$GDM3,
proto = list(a=0L, b=0L, c=0L)) |>
lapply(sprintf, fmt = "%02i")
do.call(paste0, tmp)
# [1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"
Explanation:
strcapture extracts the known patterns into a data.frame, with names and classes defined in proto (the actual values in proto are not used);
lapply(sprintf, fmt="%02i") zero-pads to 2 digits all columns of the frame
do.call(paste, tmp) concatenates each row of the frame into a single string.
Data
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
Assuming GDM3 shown in the Note at the end, read it creating a data frame and the use sprintf to create the result.
with( read.table(text = GDM3, sep = ".", comment.char = "_"),
sprintf("%02d%02d%02d", V2, V3, V4) )
giving:
[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203"
[9] "010244" "050225"
Note
GDM3 <- c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1",
"PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1",
"PER.1.2.44_1", "PER.5.2.25_1")
Another solution:
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
strsplit(X$GDM3, "\\.|_") |>
sapply(function(x) paste0(sprintf("%02i", as.numeric(x[2:4])), collapse = ""))
#[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"

How to convert a dataframe in long format into a list of an appropriate format?

I have a dataframe in the following long format:
I need to convert it into a list which should look something like this:
Wherein, each of the main element of the list would be the "Instance No." and its sub-elements should contain all its corresponding Parameter & Value pairs - in the format of "Parameter X" = "abc" as you can see in the second picture, listed one after the other.
Is there any existing function which can do this? I wasn't really able to find any. Any help would be really appreciated.
Thank you.
A dplyr solution
require(dplyr)
df_original <- data.frame("Instance No." = c(3,3,3,3,5,5,5,2,2,2,2),
"Parameter" = c("age", "workclass", "education", "occupation",
"age", "workclass", "education",
"age", "workclass", "education", "income"),
"Value" = c("Senior", "Private", "HS-grad", "Sales",
"Middle-aged", "Gov", "Hs-grad",
"Middle-aged", "Private", "Masters", "Large"),
check.names = FALSE)
# the split function requires a factor to use as the grouping variable.
# Param_Value will be the properly formated vector
df_modified <- mutate(df_original,
Param_Value = paste0(Parameter, "=", Value))
# drop the parameter and value columns now that the data is contained in Param_Value
df_modified <- select(df_modified,
`Instance No.`,
Param_Value)
# there is now a list containing dataframes with rows grouped by Instance No.
list_format <- split(df_modified,
df_modified$`Instance No.`)
# The Instance No. is still in each dataframe. Loop through each and strip the column.
list_simplified <- lapply(list_format,
select, -`Instance No.`)
# unlist the remaining Param_Value column and drop the names.
list_out <- lapply(list_simplified ,
unlist, use.names = F)
There should now be a list of vectors formatted as requested.
$`2`
[1] "age=Middle-aged" "workclass=Private" "education=Masters" "income=Large"
$`3`
[1] "age=Senior" "workclass=Private" "education=HS-grad" "occupation=Sales"
$`5`
[1] "age=Middle-aged" "workclass=Gov" "education=Hs-grad"
The posted data.table solution is faster, but I think this is a bit more understandable.
require(data.table)
your_dt <- data.table(your_df)
dt_long <- melt.data.table(your_dt, id.vars='Instance No.')
class(dt_long) # for debugging
dt_long[, strVal:=paste(variable,value, sep = '=')]
result_list <- list()
for (i in unique(dt_long[['Instance No.']])){
result_list[[as.character(i)]] <- dt_long[`Instance No.`==i, strVal]
}
Just for reference. Here is the R base oneliner to do this. df is your dataframe.
l <- lapply(split(df, list(df["Instance No."])),
function(x) paste0(x$Parameter, "=", x$Value))

Write csv file with non-numeric columns quoted and no row names

I'm trying to write a csv file from a data frame, i.e:
Col_A Col_B Col_C
Hello World 4
Once More 21
Hi Data 23
So far I use this code:
ds = dataf
write.csv(ds,"test.csv", row.names = FALSE, quote = c(1,2), sep = ",")
However, the result is:
Col_A,"Col_B","Col_C"
Hello,"World",4
Once,"More",21
Hi,"Data",23
But I really need to have something like this:
"Col_A","Col_B","Col_C"
"Hello","World",4
"Once","More",21
"Hi","Data",23
Note that everything is between double quotes unless the numeric values, separated by commas. I can do that if I also write the rownames, but I really don't want them.
There no point in setting a "set" to "," because it's the default for write.csv.
Anyway, are you sure of your data.frame design ?
This seems to work :
df <- rbind(c("Hello", "Once", "Hi"), c("World", "More", "Data"), c(4,21,23))
df <- as.data.frame(t(df))
write.csv(df,"test.csv", row.names = FALSE, quote = c(1,2))

Paste value after certain delimiter

I have data in the following format:
In Column A:
String1__String2__String3
In Column B:
Value
I would like to paste the Value into the String after the first delimiter like this:
String1__Value__String2__String3
The crucial part of the code I am using now (where I paste the value) is the following line:
df2 <-cbind(df[1],apply(df[,2:ncol(df)],2,function(i)ifelse(is.na(i), NA, paste(df[,1],i,sep="_"))))
With this code it append the value after the string, like this:
String1__String2__String3__Value
Is there an easy way to rearrange this so the Values will be pasted at the correct place. Or do I have to redo the complete code ?
Thanks
Update, Example:
Column A:
Jennifer__DoesSomething__inaCity
Column B:
2
Result now:
Jennifer__DoesSomething__inaCity__2
Desired result:
Jennifer__2__DoesSomething__inaCity
The strings Jennifer, DoesSomething, inaCity change and are not the same length. Only the delimiter stays the same. I want to paste after the first delimiter.
Thanks !
Here is an idea. Using sub we only replace the first seen pattern. So using mapply we replace all the numbers in one column with their corresponding strings on the second column.
mapply(function(x, y) sub('__', paste0('__', y, '__'), x), df$v1, df$v2)
# atsfs__dsfgg__sdgsdg eeee__FFFF__GGGG
#"atsfs__3__dsfgg__sdgsdg" "eeee__5__FFFF__GGGG"
DATA
dput(df)
structure(list(v1 = c("atsfs__dsfgg__sdgsdg", "eeee__FFFF__GGGG"
), v2 = c(3, 5)), .Names = c("v1", "v2"), row.names = c(NA, -2L
), class = "data.frame")

R convert dataframe to JSON

I have a dataframe that I'd like to convert to json format:
my data frame called res1:
library(rjson)
structure(list(id = c(1, 2, 3, 4, 5), value = structure(1:5, .Label = c("server1",
"server2", "server3", "server4", "server5"), class = "factor")), .Names = c("id",
"value"), row.names = c(NA, -5L), class = "data.frame")
when I do:
toJSON(res1)
I get this:
{"id":[1,2,3,4,5],"value":["server1","server2","server3","server4","server5"]}
I need this json output to be like this, any ideas?
[{"id":1,"value":"server1"},{"id":2,"value":"server2"},{"id":3,"value":"server3"},{"id":4,"value":"server4"},{"id":5,"value":"server5"}]
The jsonlite package exists to address exactly this problem: "A practical and consistent mapping between JSON data and R objects."
Its toJSON function provides this desired result with the default options:
library(jsonlite)
x <- toJSON(res1)
cat(x)
## [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
## {"id":3,"value":"server3"},{"id":4,"value":"server4"},
## {"id":5,"value":"server5"}]
How about
library(rjson)
x <- toJSON(unname(split(res1, 1:nrow(res1))))
cat(x)
# [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
# {"id":3,"value":"server3"},{"id":4,"value":"server4"},
# {"id":5,"value":"server5"}]
By using split() we are essentially breaking up the large data.frame into a separate data.frame for each row. And by removing the names from the resulting list, the toJSON function wraps the results in an array rather than a named object.
Now you can easily just call jsonlite::write_json() directly on the dataframe.
You can also use library(jsonify)
jsonify::to_json( res1 )
# [{"id":1.0,"value":"server1"},{"id":2.0,"value":"server2"},{"id":3.0,"value":"server3"},{"id":4.0,"value":"server4"},{"id":5.0,"value":"server5"}]

Resources