How to remove additional numbers in each cell in a dataframe - r

I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:

You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")

We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")

Related

Extract numbers from a character vector and adding leading zeros

I have a character-vector with the following structure:
GDM3
PER.1.1.1_1
PER.1.10.2_1
PER.1.1.32_1
PER.1.1.4_1
PER.1.1.5_1
PER.11.29.1_1
PER.1.2.2_1
PER.31.2.3_1
PER.1.2.44_1
PER.5.2.25_1
I want to extract the three numbers in the middle of middle of that ID and add leading numbers if they are only single digits. The finale vector can be a character vector again. In the end the result should look like this:
GDM3
010101
011002
010132
010104
010105
112901
010202
310203
010244
050225
tmp <- strcapture("\\.([0-9]+)\\.([0-9]+)\\.([0-9]+)_", X$GDM3,
proto = list(a=0L, b=0L, c=0L)) |>
lapply(sprintf, fmt = "%02i")
do.call(paste0, tmp)
# [1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"
Explanation:
strcapture extracts the known patterns into a data.frame, with names and classes defined in proto (the actual values in proto are not used);
lapply(sprintf, fmt="%02i") zero-pads to 2 digits all columns of the frame
do.call(paste, tmp) concatenates each row of the frame into a single string.
Data
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
Assuming GDM3 shown in the Note at the end, read it creating a data frame and the use sprintf to create the result.
with( read.table(text = GDM3, sep = ".", comment.char = "_"),
sprintf("%02d%02d%02d", V2, V3, V4) )
giving:
[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203"
[9] "010244" "050225"
Note
GDM3 <- c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1",
"PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1",
"PER.1.2.44_1", "PER.5.2.25_1")
Another solution:
X <- structure(list(GDM3 = c("PER.1.1.1_1", "PER.1.10.2_1", "PER.1.1.32_1", "PER.1.1.4_1", "PER.1.1.5_1", "PER.11.29.1_1", "PER.1.2.2_1", "PER.31.2.3_1", "PER.1.2.44_1", "PER.5.2.25_1")), class = "data.frame", row.names = c(NA, -10L))
strsplit(X$GDM3, "\\.|_") |>
sapply(function(x) paste0(sprintf("%02i", as.numeric(x[2:4])), collapse = ""))
#[1] "010101" "011002" "010132" "010104" "010105" "112901" "010202" "310203" "010244" "050225"

How do I find common characters in a list of dataframes?

I have about 70 dataframes in a list, each of them has a column named SNP. I want to find the common SNPs that exist in all dataframes. This is the code I used:
setwd("~")
library(data.table)
files <- list.files()
dflist <- list()
for(i in 1:length(files)){
dflist[[i]] <- fread(files[i])
}
map(dflist, ~.$SNP) %>%
reduce(intersect)
However, this returns the following message:
character(0)
list(structure(list(`10:103391446` = c("10:115562764:TTTC_",
"10:115562765:TTC_T", "10:14188623_CCTGA_C", "10:15988900:G_GGT"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
)), structure(list(SNP = c("rs34394051",
"rs11121177", "rs10799615", "rs590013")), row.names = c(NA, -4L
), class = c("data.table", "data.frame")),
structure(list(SNP = c("rs34394051", "rs11121177", "rs10799615",
"rs590013")), row.names = c(NA, -4L), class = c("data.table",
"data.frame")))
Can you help please?
Your problems appear to be two-fold:
One of your frames is missing SNP as a column name. That will often cause problems:
setdiff(mtcars$QUUX, mtcars$cyl)
# NULL
This is not hard to fix (names(dflist[[1]]) <- "SNP"), but does not resolve all of the problems.
Your first frame has completely different-looking data. When I skip the first frame, it works.
map(dflist[-1], ~.$SNP) %>%
reduce(intersect)
# [1] "rs34394051" "rs11121177" "rs10799615" "rs590013"

How to build a new list base on another file and sort it in a certain way

I have a list that contain multiple files, that looks like this:
Now I have a df that looks like this:
structure(list(Order = c(1, 2, 3, 4), Data = c("Bone Scan", "Brain Scan",
"", "Cancer History")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
How can I build a new data list which only contain the data that is in df$Data and stored in the order that appears in df?
Try to subset datalist using df$Data. It should give data in the same order as df$Data.
result <- datalist[df$Data]
We can also use pluck
library(purrr)
datalist %>%
pluck(df$Data)

Find string and create additional column

I have a list of data that contains a bunch off strings that contain currency codes. The location of the code varies within the string, and I am looking for a way to separate the code out.
I've tried searching, but all the suggestions I can find centre around the string being in the same location or separated by a similar character (eg. _ or -)
My input looks something like this:
input = structure(list(V1 = c("asdf23.USD123", "DKK1234", "1dCNY_d",
"fgdUSD33", "912#NZD")), class = "data.frame", row.names = c(NA,
-5L))
and I have a list of currencies I'm looking for like this:
fx = c("CNY", "DKK", "NZD", "USD")
I am trying to search the V1 column for values that match the list, and create a new column with the corresponding currency, eg:
output = structure(list(V1 = c("asdf23.USD123", "DKK1234", "1dCNY_d",
"fgdUSD33", "912#NZD"), V2 = c("USD", "DKK", "CNY", "USD", "NZD"
)), class = "data.frame", row.names = c(NA, -5L))
I don't know where I'd begin to look. Can anyone suggest what I should be searching for?
An option would be to extract the substring based on the value of 'fx' by pasteing the elements in to a single string
library(dplyr)
library(stringr)
input %>%
mutate(V2 = str_extract(V1, str_c(fx, collapse="|")))
# V1 V2
#1 asdf23.USD123 USD
#2 DKK1234 DKK
#3 1dCNY_d CNY
#4 fgdUSD33 USD
#5 912#NZD NZD
Or in base R
input$V2 <- regmatches(input$V1, regexpr(paste(fx, collapse="|"), input$V1))

R convert dataframe to JSON

I have a dataframe that I'd like to convert to json format:
my data frame called res1:
library(rjson)
structure(list(id = c(1, 2, 3, 4, 5), value = structure(1:5, .Label = c("server1",
"server2", "server3", "server4", "server5"), class = "factor")), .Names = c("id",
"value"), row.names = c(NA, -5L), class = "data.frame")
when I do:
toJSON(res1)
I get this:
{"id":[1,2,3,4,5],"value":["server1","server2","server3","server4","server5"]}
I need this json output to be like this, any ideas?
[{"id":1,"value":"server1"},{"id":2,"value":"server2"},{"id":3,"value":"server3"},{"id":4,"value":"server4"},{"id":5,"value":"server5"}]
The jsonlite package exists to address exactly this problem: "A practical and consistent mapping between JSON data and R objects."
Its toJSON function provides this desired result with the default options:
library(jsonlite)
x <- toJSON(res1)
cat(x)
## [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
## {"id":3,"value":"server3"},{"id":4,"value":"server4"},
## {"id":5,"value":"server5"}]
How about
library(rjson)
x <- toJSON(unname(split(res1, 1:nrow(res1))))
cat(x)
# [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
# {"id":3,"value":"server3"},{"id":4,"value":"server4"},
# {"id":5,"value":"server5"}]
By using split() we are essentially breaking up the large data.frame into a separate data.frame for each row. And by removing the names from the resulting list, the toJSON function wraps the results in an array rather than a named object.
Now you can easily just call jsonlite::write_json() directly on the dataframe.
You can also use library(jsonify)
jsonify::to_json( res1 )
# [{"id":1.0,"value":"server1"},{"id":2.0,"value":"server2"},{"id":3.0,"value":"server3"},{"id":4.0,"value":"server4"},{"id":5.0,"value":"server5"}]

Resources