Separate Million and Billion Data from one column

Separate Million and Billion Data from one column - r

I am trying below code for separating "M" and "B" with their values in 2 different column.
I want output like this:
level 1 level 2
M 3.2 B 3.6
M 4 B 2.8
B 3.5
Input:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
#class(reve)
data=data.frame(reve)
Here is what I have tried.
index=which(grepl("M ",data$reve)
data$reve=gsub("M ","",data$reve)
data$reve=gsub("B ","",data$reve)
data$reve=as.numeric(data$reve)

If you have a data frame you can do that with dplyr separate()
I give you an example of this:
library(dplyr)
df <- tibble(coupe = c("M 2.3", "M 4.5", "B 1"))
df %>% separate(coupe, c("MorB","Quant"), " ")
OUTPUT
# MorB Quant
# <chr> <chr>
#1 M 2.3
#2 M 4.5
#3 B 1
Hope it help you!
For counting the number of "M" rows:
df %>% separate(YourColumn, c("MorB","Quant"), " ") %>%
filter(MorB == "M") %>% nrow()

Here is a base R approach.
lst <- split(reve, substr(reve, 1, 1))
df1 <- as.data.frame(lapply(lst, `length<-`, max(lengths(lst))))
df1
# B M
#1 B 3.6 M 3.2
#2 B 2.8 M 4
#3 B 3.5 <NA>
split the vector in two by the first letter. This gives you a list with entries of unequal length. Use lapply to make the entries having the same length, i.e. append the shorter one with NAs. Call as.data.frame.
If you want to change the names, you can use setNames
setNames(df1, c("level_2", "level_1"))
In case I misunderstood your desired output, try
df1 <- data.frame(do.call(rbind, (strsplit(reve, " "))), stringsAsFactors = FALSE)
df1[] <- lapply(df1, type.convert, as.is = TRUE)
df1
# X1 X2
#1 M 3.2
#2 B 3.6
#3 B 2.8
#4 B 3.5
#5 M 4.0

I think options rooted in regex may also be helpful for these types of problems
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve, stringsAsFactors = F) # handle your data as strings, not factors
# regex to extract M vals and B vals
mvals <- stringi::stri_extract_all_regex(data, "M+\\s[0-9]\\.[0-9]|M+\\s[0-9]")[[1]]
bvals <- stringi::stri_extract_all_regex(data, "B+\\s[0-9]\\.[0-9]|B+\\s[0-9]")[[1]]
# gluing things together into a single df
len <- max(length(mvals), length(bvals)) # find the length
data.frame(M = c(mvals, rep(NA, len - length(mvals))) # ensure vectors are the same size
,B = c(bvals, rep(NA, len - length(bvals)))) # ensure vectors are the same size
In case regex is unfamiliar, the first expression searches for "M" followed by a space, then by digits 0 through 9, then a period, then digits 0 through 9 again. The vertical pipe is on "or" operator, so the expression also searches for "M" followed by a space, then digits 0 through 9. The second half of the expression accounts for cases like "M 4". The second expression does the same thing, just for lines that contain "B" in lieu of "M".
These are quick and dirty regex statements. I'm sure cleaner formulations are possible to get the same results.

We can count Millions or Billions as follows:
Input datatset:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve)
Code
library(dplyr)
library(tidyr)
data %>%
separate(reve, c("Label", "Value"),extra = "merge") %>%
group_by(Label) %>%
summarise(n = n())
Output
# A tibble: 2 x 2
Label n
<chr> <int>
1 B 3
2 M 2

Related

How to separate a column into two columns

df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'))
I want to separate the variable 'CODE' into two variables. One variable shows the first letter of 'CODE' ('N' or 'M' in my case), the other shows the left number. If there are more than two digits, give a '.' after the second digit.
The output should be
df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'),
VOR_1=c('N','N','N','M'),
VOR_2=c('18','18.0','19.0','19.20'))
Finally, define the variable of 'VOR_2' as a numeric variable.

Using sub for a base R solution:
df$VOR_1 <- sub("^([A-Z]).*$", "\\1", df$CODE)
df$VOR_2 <- sub("^([0-9]{2})(?=[0-9])", "\\1.", sub("^[A-Z]([0-9]+)$", "\\1", df$CODE), perl=TRUE)
df$VOR_2 <- as.numeric(df$VOR_2) # if desired
df
PATIENT_ID CODE VOR_1 VOR_2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20
An explanation on the logic behind VOR_2 is warranted. We first extract all the digits from the second character onwards using the simple regex ^[A-Z]([0-9]+)$. Then, we make a second call to sub on the digit string, to insert a decimal point after the second digit. The pattern uses a positive lookahead which ensures that a dot gets intercolated only in the case of three or more digits.

An idea via tidyr using separate can be,
library(dplyr)
library(tidyr) #separate
df %>%
separate(CODE, into = c("text", "num"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(num = as.numeric(num),
num = num / (10 ^ (nchar(num) - 2))
)
# PATIENT_ID text num
#1 1 N 18.0
#2 2 N 18.0
#3 3 N 19.0
#4 4 M 19.2

You can use str_extract and sub:
library(stringr)
df$VOR1 <- str_extract(df$CODE, "^[A-Z]")
Here, you simply grasp the capicatl letter at the beginning of the string marked by ^.
df$VOR2 <- sub("(\\d{2})(\\d{1,2})", "\\1.\\2", str_extract(df$CODE, "\\d+"))
Here, you first extract just the digits using str_extract and then insert the period .where appropriate:
Result:
df
PATIENT_ID CODE VOR1 VOR2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20

Is there any way to delete the rows of data which don't have all numeric values?

I have data that has two columns. Each column of data has numerical values in it but some of them don't have any numerical values. I want to remove the rows which don't have all values numerical. In reality, the data has 1000 rows but for simplification, I made the data file in smaller size here. Thanks!
a <- c(1, 2, 3, 4, "--")
b <- c("--", 2, 3, "--", 5)
data <- data.frame(a, b)

One base R option could be:
data[!is.na(Reduce(`+`, lapply(data, as.numeric))), ]
a b
2 2 2
3 3 3
And for importing the data, use stringsAsFactors = FALSE.
Or using sapply():
data[!is.na(rowSums(sapply(data, as.numeric))), ]

An easier option is to check for NA after converting to numeric with as.numeric. If the element is not numeric, it returns NA and that can be detected with is.na and use it in filter_all to remove the rows
library(dplyr)
data %>%
filter_all(all_vars(!is.na(as.numeric(.))))
# a b
#1 2 2
#2 3 3
If we don't like the warnings, an option is to detect the numbers only element with regex by checking one or more digits ([0-9.]+) including a dot from start (^) to end ($) of string with str_detect
library(stringr)
data %>%
filter_all(all_vars(str_detect(., "^[0-9.]+$")))
# a b
#1 2 2
#2 3 3
If we have only -- as non-numeric, it is easier to remove
data[!rowSums(data == "--"),]
# a b
#2 2 2
#3 3 3
data
data <- data.frame(a,b, stringsAsFactors = FALSE)

For loop doesn't properly filter

I want to print two dataframes, where the first one is all the rows where column a is not NA and the second is all the rows where column b is not NA.
This is my code. It prints the whole dataframe both times, without triggering the filter.
a <- cbind(rep(NA, 100), seq(0,99))
b <- cbind(seq(0,99), rep(NA, 100))
df <- as.data.frame(rbind(a,b))
names(df) <- c("a", "b")
columns <- c("a", "b")
for (j in columns){
df %>% filter(!is.na(j)) %>% print()
}
I also tried with filter(j != "") and received the same result.

As to why the downvote, I cannot know, but I can guess. You used functions that are not base R without issuing library calls for the packages that contain them, and you constructed your example dataframe in a wasteful and possibly dangerous fashion using cbind unnecessarily and as.data.frame where a single data.frame call would have been more efficient, safer and more expressive.
cbind(as.Date("1970-01-01")) # causes loss of attributes including class
# [,1]
#[1,] 0
c(factor("a"))
#[1] 1
Here's how to properly construct an example like yours:
df <- data.frame( a = c(rep(NA, 100), seq(0,99)) ,
b = c(seq(0,99), rep(NA, 100)))
And you can get a column or object whose name you have in a character vector with get (assuming that there is an appropriate
columns <- c("a", "b")
library(dplyr)
for (j in columns){
df %>% filter(!is.na( get(j) )) %>% print()
}

Do you mean something like:
not_na_a <- data.frame(which(!is.na(df$a)))
#> head(not_na_a)
which..is.na.df.a..
1 101
2 102
3 103
4 104
5 105
6 106
not_na_b <- data.frame(which(!is.na(df$b)))
#> head(not_na_b)
which..is.na.df.b..
1 1
2 2
3 3
4 4
5 5
6 6

How to calculate common values across different groups?

I am trying to create a data frame for creating network charts using igraph package. I have sample data "mydata_data" and I want to create "expected_data".
I can easily calculate number of customers visited a particular store, but how do I calculate common set of customers who go to store x1 & store x2 etc.
I have 500+ stores, so I don't want to create columns manually. Sample data for reproducible purpose given below:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"))
expected_data<-data.frame(
Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"),
Customers_Visited=c(2,3,1,1,1,0))

Another possible solution via dplyr is to create a list with all the combos for each customer, unnest that list, count and merge with a data frame with all the combinations, i.e.
library(tidyverse)
df %>%
group_by(Customer_Name) %>%
summarise(combos = list(unique(c(unique(Store_Name), paste(unique(Store_Name), collapse = '_'))))) %>%
unnest() %>%
group_by(combos) %>%
count() %>%
right_join(data.frame(combos = c(unique(df$Store_Name), combn(unique(df$Store_Name), 2, paste, collapse = '_'))))
which gives,
# A tibble: 6 x 2
# Groups: combos [?]
combos n
<chr> <int>
1 x1 2
2 x2 3
3 x3 1
4 x1_x2 1
5 x1_x3 NA
6 x2_x3 1
NOTE: Make sure that your Store_Name variable is a character NOT factor, otherwise the combn() will fail

Here's an igraph approach:
A <- as.matrix(as_adj(graph_from_edgelist(as.matrix(mydata_data), directed = FALSE)))
stores <- as.character(unique(mydata_data$Store_Name))
storeCombs <- t(combn(stores, 2))
data.frame(Store_Name = c(stores, apply(storeCombs, 1, paste, collapse = "_")),
Customers_Visited = c(colSums(A)[stores], (A %*% A)[storeCombs]))
# Store_Name Customers_Visited
# 1 x1 2
# 2 x2 3
# 3 x3 1
# 4 x1_x2 1
# 5 x1_x3 0
# 6 x2_x3 1
Explanation: A is the adjacency matrix of the corresponding undirected graph. stores is simply
stores
# [1] "x1" "x2" "x3"
while
storeCombs
# [,1] [,2]
# [1,] "x1" "x2"
# [2,] "x1" "x3"
# [3,] "x2" "x3"
The main trick then is how to obtain Customers_Visited: the first three numbers are just the corresponding numbers of neighbours of stores, while the common customers we get from the common graph neighbours (which we get from the square of A).

Here's one possible way to get the data
Here's a helper function adapted form here: Generate all combinations, of all lengths, in R, from a vector
comball <- function(x) do.call("c", lapply(seq_along(x), function(i) combn(as.character(x), i, FUN = list)))
Then you can use that with some tidy verse functions
library(dplyr)
library(purrr)
library(tidyr)
mydata_data %>%
group_by(Customer_Name) %>%
summarize(visits = list(comball(Store_Name))) %>%
mutate(visits = map(visits, ~map_chr(., ~paste(., collapse="_")))) %>%
unnest(visits) %>%
count(visits)

Another option, with base R:
Get the list of all possible stores
all_stores <- as.character(unique(mydata_data$Store_Name))
Find the different combinations of 1 or 2 stores :
all_comb_store <- lapply(1:2, function(n) combn(all_stores, n))
For each number of stores combined, get the number of customers that visited both and then combined this value in a data.frame with the names of the stores:
do.call(rbind,
lapply(all_comb_store,
function(nb_comb) {
data.frame(Store_Name=if (nrow(nb_comb)==1) as.character(nb_comb) else apply(nb_comb, 2, paste, collapse="_"),
Customers_Visited=apply(nb_comb, 2,
function(vec_stores) {
length(Reduce(intersect,
lapply(vec_stores,
function(store) mydata_data$Customer_Name[mydata_data$Store_Name %in% store])))}))}))
# Store_Name Customers_Visited
#1 x1 2
#2 x2 3
#3 x3 1
#4 x1_x2 1
#5 x1_x3 0
#6 x2_x3 1

Using dplyr: self join, then make group and get unique count. This should be a lot quicker compared to other answers where all combinations are considered.
Note: it doesn't show non-existent pairs. Also, here x1_x1 means, of course, x1.
left_join(mydata_data, mydata_data, by = "Customer_Name") %>%
transmute(Customer_Name,
grp = paste(pmin(Store_Name.x, Store_Name.y),
pmax(Store_Name.x, Store_Name.y), sep = "_")) %>%
group_by(grp) %>%
summarise(n = n_distinct(Customer_Name))
# # A tibble: 5 x 2
# grp n
# <chr> <int>
# 1 x1_x1 2
# 2 x1_x2 1
# 3 x2_x2 3
# 4 x2_x3 1
# 5 x3_x3 1
Data without factors:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"),
stringsAsFactors = FALSE)

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!

I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.

If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")

Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98

I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Separate Million and Billion Data from one column - r

Related

How to separate a column into two columns

Is there any way to delete the rows of data which don't have all numeric values?

For loop doesn't properly filter

How to calculate common values across different groups?

R regular expression for p#q#c#

Categories

Resources