R concatenating a table into a string - r

I have a dataframe
A | B
A_Long_String | 7.1123
Another_String | 1234124
and I need to change it into a long string to feed into sql proc.
So the format needs to be
' "A_Long_String": "7.1123" \n "Another_String": "1234124" \n '
etc etc
What is the best way of doing this. Do I just have to loop?
The dataframe can be upwards of 100,000 rows.
Thanks

Try
df1[] <- lapply(df1, dQuote)
paste(do.call(paste, c(df1, sep=": ")), '\n', collapse=' ')
#[1] "“A_Long_String”: “7.1123” \n “Another_String”: “1234124” \n"
data
df1 <- structure(list(A = c("A_Long_String", "Another_String"),
B = c(7.1123,
1234124)), .Names = c("A", "B"), class = "data.frame",
row.names = c(NA, -2L))

Related

Extract numbers from a character vector and adding leading zeros

I have a character-vector with the following structure:
GDM3
PER.1.1.1_1
PER.1.10.2_1
PER.1.1.32_1
PER.1.1.4_1
PER.1.1.5_1
PER.11.29.1_1
PER.1.2.2_1
PER.31.2.3_1
PER.1.2.44_1
PER.5.2.25_1
I want to extract the three numbers in the middle of middle of that ID and add leading numbers if they are only single digits. The finale vector can be a character vector again. In the end the result should look like this:
GDM3
010101
011002
010132
010104
010105
112901
010202
310203
010244
050225
tmp <- strcapture("\\.([0-9]+)\\.([0-9]+)\\.([0-9]+)_", X$GDM3,
proto = list(a=0L, b=0L, c=0L)) |>
lapply(sprintf, fmt = "%02i")
do.call(paste0, tmp)
# [1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"
Explanation:
strcapture extracts the known patterns into a data.frame, with names and classes defined in proto (the actual values in proto are not used);
lapply(sprintf, fmt="%02i") zero-pads to 2 digits all columns of the frame
do.call(paste, tmp) concatenates each row of the frame into a single string.
Data
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
Assuming GDM3 shown in the Note at the end, read it creating a data frame and the use sprintf to create the result.
with( read.table(text = GDM3, sep = ".", comment.char = "_"),
sprintf("%02d%02d%02d", V2, V3, V4) )
giving:
[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203"
[9] "010244" "050225"
Note
GDM3 <- c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1",
"PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1",
"PER.1.2.44_1", "PER.5.2.25_1")
Another solution:
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
strsplit(X$GDM3, "\\.|_") |>
sapply(function(x) paste0(sprintf("%02i", as.numeric(x[2:4])), collapse = ""))
#[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"

extract until next "_" if contains

Is there a way to extract part of string, when there is a match (everything up to the next underscore) "_"?
From: mycampaign_s22uhd4k_otherinfo I need: s22uhd4k.
From: my_campaign_otherinfo_s22jumpto_otherinfo , I would need: s22jumpto
data:
df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo", "my_campaign_otherinfo_s22jumpto_otherinfo"
), b = c(1, 2)), class = "data.frame", row.names = c(NA, -2L))
Thanks Omar, based on your update/comment, this regex will solve your problem:
df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo",
"my_campaign_otherinfo_s22jumpto_otherinfo",
"e220041_pe_mx_aon_aonjulio_conversion_shop_facebook-network_ppl_primaria_s22test512gb_hotsale_20220620"
), b = c(1, 2, 3)), class = "data.frame", row.names = c(NA, -3L))
gsub(df$a, pattern = ".*(s22[^_]+(?=_)).*", replacement = "\\1", perl = TRUE)
#> [1] "s22uhd4k" "s22jumpto" "s22test512gb"
Created on 2022-07-17 by the reprex package (v2.0.1)
Explanation:
.*(s22[^_]+(?=_)).*
.* match all characters up until the first capture group
(s22 the first capture group starts with "s22"
[^_]+ after "s22", match any character except "_"
(?=_) until the next "_" (positive look ahead)
) close the first capture group
.* match all remaining characters
Then, the replacement = "\\1" means to just print the captured text (the part you want)

How to remove additional numbers in each cell in a dataframe

I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:
You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")

concatenate strings of one column separated by comma in R

I would like to concatenate string with double quotes which is in a column followed by comma as I would like import this to an SQL query.
A<-c('John', 'Kate', 'Kaitlyn', 'Arun',' Chen')
df<- data.frame(A)
So I would like to concatenate all the rows of A and get a string from this column A and would like to get the output as text file. Below is the expected text. Any suggestions?
"John", "Kate", "Kaitlyn", "Arun", "Chen"
Try this approach:
#Data
A<-c('John', 'Kate', 'Kaitlyn', 'Arun',' Chen')
df<- data.frame(A,stringsAsFactors = F)
df$A <- paste0('\"',trimws(df$A),'\"')
#Collapse
df2 <- data.frame(val=paste0(df$A,collapse = ', '),stringsAsFactors = F)
#Export
write.table(df2,file='File.txt',row.names = F,col.names = F,quote = F)
Output:

Parse text with separator depending on its structure

My dataframe:
>datasetM
Mean
ENSORLG00000001933:tex11 2500.706
ENSORLG00000010797: 44225.330
ENSORLG00000003008:pabpc1a 11788.555
ENSORLG00000001973:sept6 3100.493
ENSORLG00000000997: 5418.796
Output needed:
>out
[1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6" "ENSORLG00000000997"
I tried this, but I only retrieve the part before the separator:
titles <- rownames(datasetM)
vapply(strsplit(titles,":"), `[`, 1, FUN.VALUE=character(1))
Note: There is not logic in the alternance of ENS000:name and ENS00:
Note 2: ENSOR are rownames
Note 3: When there is nothing after ":" I want the ENSOR
Here is a solution with base R:
sapply(strsplit(rownames(df), ":"), function(x) x[length(x)])
# [1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6"
# [5] "ENSORLG00000000997"
Another solution with sub, might be simpler:
sub("^\\w+:(?=\\w)|:", "", rownames(df), perl = TRUE)
# [1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6"
# [5] "ENSORLG00000000997"
Data:
df = read.table(text = " Mean
ENSORLG00000001933:tex11 2500.706
ENSORLG00000010797: 44225.330
ENSORLG00000003008:pabpc1a 11788.555
ENSORLG00000001973:sept6 3100.493
ENSORLG00000000997: 5418.796", header = TRUE, row.names = 1)
Here is a vectorized way to do this using a regex (taken from here) to identify the last character of each rowname,
rownames(df)[!sub('.*(?=.$)', '', rownames(df), perl=TRUE) == ':'] <-
sub('.*:', '', rownames(df)[!sub('.*(?=.$)', '', rownames(df), perl=TRUE) == ':'])
which gives,
V2
tex11 2500.706
ENSORLG00000010797: 44225.330
pabpc1a 11788.555
sept6 3100.493
ENSORLG00000000997: 5418.796
DATA
dput(df)
structure(list(V2 = c(2500.706, 44225.33, 11788.555, 3100.493,
5418.796)), .Names = "V2", row.names = c("tex11", "ENSORLG00000010797:",
"pabpc1a", "sept6", "ENSORLG00000000997:"), class = "data.frame")
NOTE You can remove the colons from rownames simply by
rownames(df) <- sub(':', '', rownames(df))

Resources