I am trying to sort the interior of a column in R. For example I have this:
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
And I am trying to sort the numbers in the column internally like this
ID HoursAvailable
1 a,b,c,d,k
2 e,g,h,,
3 a,b,c,d,h
I have tried to use the separate function like this:
cdMCd<- cdMf %>% separate(HoursAvailable, c("a","b","c","d","e","f","g","h","i","j"))
But I cannot get it to sort correctly. For this example e in ID 2 would be sorted into the a column, but I need it sorted into the e column. I was planning to separate all the hours into separate columns, order, then recombine, but I cannot get them to separate correctly.
library(dplyr)
dt = read.table(text="
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
", header=T, stringsAsFactors=F)
SortString = function(x) {paste0(sort(unlist(strsplit(x, split=","))),collapse = ",")}
dt %>%
rowwise() %>%
mutate(Updated = SortString(HoursAvailable)) %>%
ungroup()
# # A tibble: 3 x 3
# ID HoursAvailable Updated
# <int> <chr> <chr>
# 1 1 a,b,c,k,d a,b,c,d,k
# 2 2 e,g,h e,g,h
# 3 3 a,b,c,h,d a,b,c,d,h
Here is what I will do:
First create a function that can sort a single one and then create a function that can apply such function to a vector of strings
library(stringr)
library(plyr)
split_and_sort <- function(x){
x_split <- sort(unlist(str_split(x, ",")))
return(paste(x_split, collapse = ","))
}
split_and_sort_column <- function(x){
laply(x, split_and_sort)
}
df$HoursAvailable <- split_and_sort_column(df$HoursAvailable)
Related
I am currently trying to run multiple data using loop, and ended up with the results below. Each line corresponds to the output from one data that has been filtered.
I am using this code to get the results below.
output <- print(paste(data.final$Peptide, collapse = ','))
It was previously given the form of table with one of the column name "Peptide", so I am pasting the peptide into a string separated comma, as shown here :
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
I would like to find the number of duplicates from each comma-separated strings (eg. LPPAYTNSF) between the lines.
Is there anyways to do this?
If I understand you well, there is no reason to first collapse them to a string. Why not just get the counts from your data.frame? Any n > 1 has duplicates.
data.final %>% count(Peptide)
solution based on your string
table(unlist(strsplit(v, ",")))
results
ASFSTFKCY CVADYSVLY KIYSKHTPI LPFNDGVYF LPPAYTNSF LPSAYTNSF QSYGFQPTY RLFRKSNLK SANNCTFEY SAYTNSFTR TSNQVAVLY WMESEFRVY
8 8 8 8 6 2 6 8 8 2 8 8
WTAGAAAYY YLQPRTFLL YNSASFSTF YSSANNCTF
8 8 8 8
data
v <- "LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
Not quite sure what your data are like so I've scribbled together some toy data to show how a regex and a non-regex tidyverse solution would work on your task of "find[ing] the number of duplicates from each comma-separated strings":
Data:
x <- c("a,c,x,a,f,s,w,s,b,n,x,q",
"A,B,B,X,B,Q")
A regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# create new column with duplicated values:
mutate(dups = str_extract_all(x, "([A-Za-z])+(?=.*\\1)"))
x dups
1 a,c,x,a,f,s,w,s,b,n,x,q a, x, s
2 A,B,B,X,B,Q B, B
NB: the + in the regex pattern makes sure you can use this solution also for longer-than-one-character alphabetic comma-separated strings
Alternatively, a non-regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# make column with unique ID for each string:
mutate(stringID = row_number()) %>%
# separate values into rows:
separate_rows(x) %>%
# for each combination of `stringID` and `x`...
group_by(stringID, x) %>%
# ...count the number of tokens:
summarise(N = n()) %>%
# show only the duplicated values:
filter(N > 1)
# A tibble: 4 × 3
# Groups: stringID [2]
stringID x N
<int> <chr> <int>
1 1 a 2
2 1 s 2
3 1 x 2
4 2 B 3
I am trying to create a data frame for creating network charts using igraph package. I have sample data "mydata_data" and I want to create "expected_data".
I can easily calculate number of customers visited a particular store, but how do I calculate common set of customers who go to store x1 & store x2 etc.
I have 500+ stores, so I don't want to create columns manually. Sample data for reproducible purpose given below:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"))
expected_data<-data.frame(
Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"),
Customers_Visited=c(2,3,1,1,1,0))
Another possible solution via dplyr is to create a list with all the combos for each customer, unnest that list, count and merge with a data frame with all the combinations, i.e.
library(tidyverse)
df %>%
group_by(Customer_Name) %>%
summarise(combos = list(unique(c(unique(Store_Name), paste(unique(Store_Name), collapse = '_'))))) %>%
unnest() %>%
group_by(combos) %>%
count() %>%
right_join(data.frame(combos = c(unique(df$Store_Name), combn(unique(df$Store_Name), 2, paste, collapse = '_'))))
which gives,
# A tibble: 6 x 2
# Groups: combos [?]
combos n
<chr> <int>
1 x1 2
2 x2 3
3 x3 1
4 x1_x2 1
5 x1_x3 NA
6 x2_x3 1
NOTE: Make sure that your Store_Name variable is a character NOT factor, otherwise the combn() will fail
Here's an igraph approach:
A <- as.matrix(as_adj(graph_from_edgelist(as.matrix(mydata_data), directed = FALSE)))
stores <- as.character(unique(mydata_data$Store_Name))
storeCombs <- t(combn(stores, 2))
data.frame(Store_Name = c(stores, apply(storeCombs, 1, paste, collapse = "_")),
Customers_Visited = c(colSums(A)[stores], (A %*% A)[storeCombs]))
# Store_Name Customers_Visited
# 1 x1 2
# 2 x2 3
# 3 x3 1
# 4 x1_x2 1
# 5 x1_x3 0
# 6 x2_x3 1
Explanation: A is the adjacency matrix of the corresponding undirected graph. stores is simply
stores
# [1] "x1" "x2" "x3"
while
storeCombs
# [,1] [,2]
# [1,] "x1" "x2"
# [2,] "x1" "x3"
# [3,] "x2" "x3"
The main trick then is how to obtain Customers_Visited: the first three numbers are just the corresponding numbers of neighbours of stores, while the common customers we get from the common graph neighbours (which we get from the square of A).
Here's one possible way to get the data
Here's a helper function adapted form here: Generate all combinations, of all lengths, in R, from a vector
comball <- function(x) do.call("c", lapply(seq_along(x), function(i) combn(as.character(x), i, FUN = list)))
Then you can use that with some tidy verse functions
library(dplyr)
library(purrr)
library(tidyr)
mydata_data %>%
group_by(Customer_Name) %>%
summarize(visits = list(comball(Store_Name))) %>%
mutate(visits = map(visits, ~map_chr(., ~paste(., collapse="_")))) %>%
unnest(visits) %>%
count(visits)
Another option, with base R:
Get the list of all possible stores
all_stores <- as.character(unique(mydata_data$Store_Name))
Find the different combinations of 1 or 2 stores :
all_comb_store <- lapply(1:2, function(n) combn(all_stores, n))
For each number of stores combined, get the number of customers that visited both and then combined this value in a data.frame with the names of the stores:
do.call(rbind,
lapply(all_comb_store,
function(nb_comb) {
data.frame(Store_Name=if (nrow(nb_comb)==1) as.character(nb_comb) else apply(nb_comb, 2, paste, collapse="_"),
Customers_Visited=apply(nb_comb, 2,
function(vec_stores) {
length(Reduce(intersect,
lapply(vec_stores,
function(store) mydata_data$Customer_Name[mydata_data$Store_Name %in% store])))}))}))
# Store_Name Customers_Visited
#1 x1 2
#2 x2 3
#3 x3 1
#4 x1_x2 1
#5 x1_x3 0
#6 x2_x3 1
Using dplyr: self join, then make group and get unique count. This should be a lot quicker compared to other answers where all combinations are considered.
Note: it doesn't show non-existent pairs. Also, here x1_x1 means, of course, x1.
left_join(mydata_data, mydata_data, by = "Customer_Name") %>%
transmute(Customer_Name,
grp = paste(pmin(Store_Name.x, Store_Name.y),
pmax(Store_Name.x, Store_Name.y), sep = "_")) %>%
group_by(grp) %>%
summarise(n = n_distinct(Customer_Name))
# # A tibble: 5 x 2
# grp n
# <chr> <int>
# 1 x1_x1 2
# 2 x1_x2 1
# 3 x2_x2 3
# 4 x2_x3 1
# 5 x3_x3 1
Data without factors:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"),
stringsAsFactors = FALSE)
I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))
I have a data like this (named spectra):
#Milk spectra: 1234
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216
In this data, each time the string ##XYDATA=(X++(Y..Y)), that is the data for each different animal.
So, I want to have the code that can help extract this sample into 3 pieces of data.
Animal 1: 3 lines after 1st ' ##XYDATA=(X++(Y..Y))'
Animal 2: 3 lines after 2nd ' ##XYDATA=(X++(Y..Y))'
And so on.
I tried this line of code but it only help to extract line 1 of all times the string '##XYDATA=(X++(Y..Y))' appeared together. Thus, it did not meet my expect to have three lines and to have a separate pieces of data after each appearance of the string.
bo<-data.frame(spectra$V1[which(spectra$V1 == '##XYDATA=(X++(Y..Y))')+1])
Okay I think you could do something along these lines. I'm sure this could be much better and more efficient but read it in as a character vector.
Then loop through to spread it out. However this assumes there are always the same number of measures and you have a way to identify the character values.
c_data<- c("split", 1, 2, 3,
"split", 4, 5, 6)
y<- c_data == "split"
df_wide <- data.frame("animal"= character(), "v1" = numeric(), "v2" = numeric(), "v3" = numeric(),
stringsAsFactors = FALSE)
names(df_wide)<- c("animal", "v1", "v2", "v3")
x <- 0
for (i in 1:length(c_data)){
if (y[i] == TRUE){
x <- x +1
df_wide[x,] <- rbind(c(c_data[i], c_data[i+1], c_data[i+2], c_data[i+3]))
}
}
yields
animal v1 v2 v3
1 split 1 2 3
2 split 4 5 6
If it is a one time thing, it may not be worth trying to write something nicer. If it is an ongoing thing then you may want to look at using an apply function that you could have to write a function for.
You can do either of the following with split and map:
library(dplyr)
library(purrr)
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
split(.$Animal) %>%
map(~slice(., -1) %>% mutate(V1 = as.numeric(V1))) %>%
'['(-1)
This creates an indicator variable Animal, split by that indicator, remove the first row for each dataframe, convert V1 to numeric, and finally remove the first element of the list.
You can also do the following:
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
filter(!grepl("^#.*$", V1)) %>%
mutate(V1 = as.numeric(V1)) %>%
split(.$Animal)
This also creates the indicator Animal, but it intead, filters out all rows with # signs in it and converts V1 to numeric before splitting into separate dataframes.
Result:
$`1`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 649.0251 1
2 667.6752 1
3 686.3254 1
$`2`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 723.6257 2
2 742.2758 2
3 760.9260 2
$`3`
# A tibble: 4 x 2
V1 Animal
<dbl> <int>
1 872.8268 3
2 891.4770 3
3 910.1271 3
4 928.7773 3
Note:
Here I assumed #Milk spectra: 1234 is also a row in your column, hence the subsetting at the end.
Data:
df = read.table(textConnection("'#Milk spectra: 1234'
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216"),comment.char = "", stringsAsFactors = FALSE)
I have a list of files. I also have a list of "names" which I substr() from the actual filenames of these files. I would like to add a new column to each of the files in the list. This column will contain the corresponding element in "names" repeated times the number of rows in the file.
For example:
df1 <- data.frame(x = 1:3, y=letters[1:3])
df2 <- data.frame(x = 4:6, y=letters[4:6])
filelist <- list(df1,df2)
ID <- c("1A","IB")
Pseudocode
for( i in length(filelist)){
filelist[i]$SampleID <- rep(ID[i],nrow(filelist[i])
}
// basically create a new column in each of the dataframes in filelist, and fill the column with repeted corresponding values of ID
my output should be like:
filelist[1] should be:
x y SAmpleID
1 1 a 1A
2 2 b 1A
3 3 c 1A
fileList[2]
x y SampleID
1 4 d IB
2 5 e IB
3 6 f IB
and so on.....
Any Idea how it could be done.
An alternate solution is to use cbind, and taking advantage of the fact that R will recylce values of a shorter vector.
For Example
x <- df2 # from above
cbind(x, NewColumn="Singleton")
# x y NewColumn
# 1 4 d Singleton
# 2 5 e Singleton
# 3 6 f Singleton
There is no need for the use of rep. R does that for you.
Therfore, you could put cbind(filelist[[i]], ID[[i]]) in your for loop or as #Sven pointed out, you can use the cleaner mapply:
filelist <- mapply(cbind, filelist, "SampleID"=ID, SIMPLIFY=F)
This is a corrected version of your loop:
for( i in seq_along(filelist)){
filelist[[i]]$SampleID <- rep(ID[i],nrow(filelist[[i]]))
}
There were 3 problems:
A final ) was missing after the command in the body.
Elements of lists are accessed by [[, not by [. [ returns a list of length one. [[ returns the element only.
length(filelist) is just one value, so the loop runs for the last element of the list only. I replaced it with seq_along(filelist).
A more efficient approach is to use mapply for the task:
mapply(function(x, y) "[<-"(x, "SampleID", value = y) ,
filelist, ID, SIMPLIFY = FALSE)
This one worked for me:
Create a new column for every dataframe in a list; fill the values of the new column based on existing column. (In your case IDs).
Example:
# Create dummy data
df1<-data.frame(a = c(1,2,3))
df2<-data.frame(a = c(5,6,7))
# Create a list
l<-list(df1, df2)
> l
[[1]]
a
1 1
2 2
3 3
[[2]]
a
1 5
2 6
3 7
# add new column 'b'
# create 'b' values based on column 'a'
l2<-lapply(l, function(x)
cbind(x, b = x$a*4))
Results in:
> l2
[[1]]
a b
1 1 4
2 2 8
3 3 12
[[2]]
a b
1 5 20
2 6 24
3 7 28
In your case something like:
filelist<-lapply(filelist, function(x)
cbind(x, b = x$SampleID))
The purrr way, using map2
library(dplyr)
library(purrr)
map2(filelist, ID, ~cbind(.x, SampleID = .y))
#[[1]]
# x y SampleId
#1 1 a 1A
#2 2 b 1A
#3 3 c 1A
#[[2]]
# x y SampleId
#1 4 d IB
#2 5 e IB
#3 6 f IB
Or can also use
map2(filelist, ID, ~.x %>% mutate(SampleId = .y))
If you name the list, we can use imap and add the new column based on it's name.
names(filelist) <- c("1A","IB")
imap(filelist, ~cbind(.x, SampleID = .y))
#OR
#imap(filelist, ~.x %>% mutate(SampleId = .y))
which is similar to using Map
Map(cbind, filelist, SampleID = names(filelist))
A tricky way:
library(plyr)
names(filelist) <- ID
result <- ldply(filelist, data.frame)
data_lst <- list(
data_1 = data.frame(c1 = 1:3, c2 = 3:1),
data_2 = data.frame(c1 = 1:3, c2 = 3:1)
)
f <- function (data, name){
data$name <- name
data
}
Map(f, data_lst , names(data_lst))