Extract data after a string one each appearance R - r

I have a data like this (named spectra):
#Milk spectra: 1234
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216
In this data, each time the string ##XYDATA=(X++(Y..Y)), that is the data for each different animal.
So, I want to have the code that can help extract this sample into 3 pieces of data.
Animal 1: 3 lines after 1st ' ##XYDATA=(X++(Y..Y))'
Animal 2: 3 lines after 2nd ' ##XYDATA=(X++(Y..Y))'
And so on.
I tried this line of code but it only help to extract line 1 of all times the string '##XYDATA=(X++(Y..Y))' appeared together. Thus, it did not meet my expect to have three lines and to have a separate pieces of data after each appearance of the string.
bo<-data.frame(spectra$V1[which(spectra$V1 == '##XYDATA=(X++(Y..Y))')+1])

Okay I think you could do something along these lines. I'm sure this could be much better and more efficient but read it in as a character vector.
Then loop through to spread it out. However this assumes there are always the same number of measures and you have a way to identify the character values.
c_data<- c("split", 1, 2, 3,
"split", 4, 5, 6)
y<- c_data == "split"
df_wide <- data.frame("animal"= character(), "v1" = numeric(), "v2" = numeric(), "v3" = numeric(),
stringsAsFactors = FALSE)
names(df_wide)<- c("animal", "v1", "v2", "v3")
x <- 0
for (i in 1:length(c_data)){
if (y[i] == TRUE){
x <- x +1
df_wide[x,] <- rbind(c(c_data[i], c_data[i+1], c_data[i+2], c_data[i+3]))
}
}
yields
animal v1 v2 v3
1 split 1 2 3
2 split 4 5 6
If it is a one time thing, it may not be worth trying to write something nicer. If it is an ongoing thing then you may want to look at using an apply function that you could have to write a function for.

You can do either of the following with split and map:
library(dplyr)
library(purrr)
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
split(.$Animal) %>%
map(~slice(., -1) %>% mutate(V1 = as.numeric(V1))) %>%
'['(-1)
This creates an indicator variable Animal, split by that indicator, remove the first row for each dataframe, convert V1 to numeric, and finally remove the first element of the list.
You can also do the following:
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
filter(!grepl("^#.*$", V1)) %>%
mutate(V1 = as.numeric(V1)) %>%
split(.$Animal)
This also creates the indicator Animal, but it intead, filters out all rows with # signs in it and converts V1 to numeric before splitting into separate dataframes.
Result:
$`1`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 649.0251 1
2 667.6752 1
3 686.3254 1
$`2`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 723.6257 2
2 742.2758 2
3 760.9260 2
$`3`
# A tibble: 4 x 2
V1 Animal
<dbl> <int>
1 872.8268 3
2 891.4770 3
3 910.1271 3
4 928.7773 3
Note:
Here I assumed #Milk spectra: 1234 is also a row in your column, hence the subsetting at the end.
Data:
df = read.table(textConnection("'#Milk spectra: 1234'
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216"),comment.char = "", stringsAsFactors = FALSE)

Related

Why does stringr::str_match on a column return a matrix?

I'm using tidyverse to load the data, so I have a tibble which you can reproduce like:
df_1 <- tibble(id = c(1, 2, 3), subject_id = c("ABCD-FOO1-G001-YX-732E5", "ABCD-FOO2-A011-ZA-892N2", "ABCD-FOO3-1001-CD-742W5"))
Now I want to modify subject_id to extract just the two first character groups, i.e:
"ABCD-FOO1-G001-YX-732E5" -> "ABCD-FOO1"
When I'm running the following code:
df_1 %>% mutate(subject_id = stringr::str_match(subject_id, "[^-]*-[^-]*"))
each element of the subject_id column is a tibble itself:
> class(df_1[1, "subject_id"])
[1] "tbl_df" "tbl" "data.frame"
How do I make sure subject_id is a character vector instead of tibble?
We can use str_extract
library(stringr)
library(dplyr)
df_1 %>%
mutate(subject_id = str_extract(subject_id, "^\\w+-\\w+"))
# A tibble: 3 x 2
# id subject_id
# <dbl> <chr>
#1 1 ABCD-FOO1
#2 2 ABCD-FOO2
#3 3 ABCD-FOO3
Here a take on the how to avoid this rather than the why.
As we learn from ?str_match:
For str_match, a character matrix. First column is the complete match, followed by one column for each capture group. [...]
So we need to pull the first column from the matrix:
df_1 %>% mutate(subject_id = stringr::str_match(subject_id, "[^-]*-[^-]*") %>% .[,1])
# # A tibble: 3 x 2
# id subject_id
# <dbl> <chr>
# 1 1 ABCD-FOO1
# 2 2 ABCD-FOO2
# 3 3 ABCD-FOO3
Also keep in mind, that in your example of class(), you subset a tibble. A tibble will always stay a tibble even if it has only 1 cell. See for comparison class(df_2[1,"id"]). For more on that have a look at this chapter from R for Data Science.

How to calculate common values across different groups?

I am trying to create a data frame for creating network charts using igraph package. I have sample data "mydata_data" and I want to create "expected_data".
I can easily calculate number of customers visited a particular store, but how do I calculate common set of customers who go to store x1 & store x2 etc.
I have 500+ stores, so I don't want to create columns manually. Sample data for reproducible purpose given below:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"))
expected_data<-data.frame(
Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"),
Customers_Visited=c(2,3,1,1,1,0))
Another possible solution via dplyr is to create a list with all the combos for each customer, unnest that list, count and merge with a data frame with all the combinations, i.e.
library(tidyverse)
df %>%
group_by(Customer_Name) %>%
summarise(combos = list(unique(c(unique(Store_Name), paste(unique(Store_Name), collapse = '_'))))) %>%
unnest() %>%
group_by(combos) %>%
count() %>%
right_join(data.frame(combos = c(unique(df$Store_Name), combn(unique(df$Store_Name), 2, paste, collapse = '_'))))
which gives,
# A tibble: 6 x 2
# Groups: combos [?]
combos n
<chr> <int>
1 x1 2
2 x2 3
3 x3 1
4 x1_x2 1
5 x1_x3 NA
6 x2_x3 1
NOTE: Make sure that your Store_Name variable is a character NOT factor, otherwise the combn() will fail
Here's an igraph approach:
A <- as.matrix(as_adj(graph_from_edgelist(as.matrix(mydata_data), directed = FALSE)))
stores <- as.character(unique(mydata_data$Store_Name))
storeCombs <- t(combn(stores, 2))
data.frame(Store_Name = c(stores, apply(storeCombs, 1, paste, collapse = "_")),
Customers_Visited = c(colSums(A)[stores], (A %*% A)[storeCombs]))
# Store_Name Customers_Visited
# 1 x1 2
# 2 x2 3
# 3 x3 1
# 4 x1_x2 1
# 5 x1_x3 0
# 6 x2_x3 1
Explanation: A is the adjacency matrix of the corresponding undirected graph. stores is simply
stores
# [1] "x1" "x2" "x3"
while
storeCombs
# [,1] [,2]
# [1,] "x1" "x2"
# [2,] "x1" "x3"
# [3,] "x2" "x3"
The main trick then is how to obtain Customers_Visited: the first three numbers are just the corresponding numbers of neighbours of stores, while the common customers we get from the common graph neighbours (which we get from the square of A).
Here's one possible way to get the data
Here's a helper function adapted form here: Generate all combinations, of all lengths, in R, from a vector
comball <- function(x) do.call("c", lapply(seq_along(x), function(i) combn(as.character(x), i, FUN = list)))
Then you can use that with some tidy verse functions
library(dplyr)
library(purrr)
library(tidyr)
mydata_data %>%
group_by(Customer_Name) %>%
summarize(visits = list(comball(Store_Name))) %>%
mutate(visits = map(visits, ~map_chr(., ~paste(., collapse="_")))) %>%
unnest(visits) %>%
count(visits)
Another option, with base R:
Get the list of all possible stores
all_stores <- as.character(unique(mydata_data$Store_Name))
Find the different combinations of 1 or 2 stores :
all_comb_store <- lapply(1:2, function(n) combn(all_stores, n))
For each number of stores combined, get the number of customers that visited both and then combined this value in a data.frame with the names of the stores:
do.call(rbind,
lapply(all_comb_store,
function(nb_comb) {
data.frame(Store_Name=if (nrow(nb_comb)==1) as.character(nb_comb) else apply(nb_comb, 2, paste, collapse="_"),
Customers_Visited=apply(nb_comb, 2,
function(vec_stores) {
length(Reduce(intersect,
lapply(vec_stores,
function(store) mydata_data$Customer_Name[mydata_data$Store_Name %in% store])))}))}))
# Store_Name Customers_Visited
#1 x1 2
#2 x2 3
#3 x3 1
#4 x1_x2 1
#5 x1_x3 0
#6 x2_x3 1
Using dplyr: self join, then make group and get unique count. This should be a lot quicker compared to other answers where all combinations are considered.
Note: it doesn't show non-existent pairs. Also, here x1_x1 means, of course, x1.
left_join(mydata_data, mydata_data, by = "Customer_Name") %>%
transmute(Customer_Name,
grp = paste(pmin(Store_Name.x, Store_Name.y),
pmax(Store_Name.x, Store_Name.y), sep = "_")) %>%
group_by(grp) %>%
summarise(n = n_distinct(Customer_Name))
# # A tibble: 5 x 2
# grp n
# <chr> <int>
# 1 x1_x1 2
# 2 x1_x2 1
# 3 x2_x2 3
# 4 x2_x3 1
# 5 x3_x3 1
Data without factors:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"),
stringsAsFactors = FALSE)

Command for renaming several variables

After reshaping my data, I have a large dataset with columnnames that look like this:
1_abc 1_vwxyz 2_abc 2_vwxyz
I would like to change my column names to look like this: abc_1 vwxyz_1 abc_2 vwxyz_2
My code looks like this:
data <- tibble("1_abc" = c(1,2,3), "1_vwxyz" = c(10,11,12),
"2_abc" = c(1,1,2),"2_vwxyz" = c(9,11,15))
data_renamed <- data %>%
rename_(.dots=setNames(names(.), paste(substr(names(.), start=3, stop=nchar(names(.))),
substr(names(.), start=1, stop=1))))
I get this error:
Error in parse(text = x) : <text>:1:2: unexpected input
1: 1_
^
Here's a solution in base R. You first take the column names as a character vector, convert them to a list of two-element character vectors, reverse the order of each and put them back together with _.
ll <- strsplit(colnames(data), pattern = "_")
# apply across this list of character vectors to reverse the order and concatenate
ll1 <- lapply(ll, function(x) paste(rev(x), collapse = "_"))
# unlist and assign them to the new data frame
data_renamed <- data
colnames(data_renamed) <- unlist(ll1)
# A tibble: 3 x 4
# abc_1 vwxyz_1 abc_2 vwxyz_2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 10 1 9
# 2 2 11 1 11
# 3 3 12 2 15

unquote string as variable in pipe

I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5

Sorting Interior of a column in R

I am trying to sort the interior of a column in R. For example I have this:
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
And I am trying to sort the numbers in the column internally like this
ID HoursAvailable
1 a,b,c,d,k
2 e,g,h,,
3 a,b,c,d,h
I have tried to use the separate function like this:
cdMCd<- cdMf %>% separate(HoursAvailable, c("a","b","c","d","e","f","g","h","i","j"))
But I cannot get it to sort correctly. For this example e in ID 2 would be sorted into the a column, but I need it sorted into the e column. I was planning to separate all the hours into separate columns, order, then recombine, but I cannot get them to separate correctly.
library(dplyr)
dt = read.table(text="
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
", header=T, stringsAsFactors=F)
SortString = function(x) {paste0(sort(unlist(strsplit(x, split=","))),collapse = ",")}
dt %>%
rowwise() %>%
mutate(Updated = SortString(HoursAvailable)) %>%
ungroup()
# # A tibble: 3 x 3
# ID HoursAvailable Updated
# <int> <chr> <chr>
# 1 1 a,b,c,k,d a,b,c,d,k
# 2 2 e,g,h e,g,h
# 3 3 a,b,c,h,d a,b,c,d,h
Here is what I will do:
First create a function that can sort a single one and then create a function that can apply such function to a vector of strings
library(stringr)
library(plyr)
split_and_sort <- function(x){
x_split <- sort(unlist(str_split(x, ",")))
return(paste(x_split, collapse = ","))
}
split_and_sort_column <- function(x){
laply(x, split_and_sort)
}
df$HoursAvailable <- split_and_sort_column(df$HoursAvailable)

Resources