I have a CelldataSet object (cds):
> class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
composed of 6 different aggregated samples that can be distinguished by the suffixes of their barcodes. Here is a sample of what these look like:
cds$barcode
1 ACCAACGACTTGCC-1
2 CGCACTACTCGATG-4
3 CGTACAGAGTATCG-5
4 CGTCAAGATCACCC-5
5 ACTGAGACCCGTAA-2
6 TTAGACCTCGGGAA-6
7 TTCAAGCTGGTATC-3
8 TTTGACTGTCCTTA-4
9 TTTGCATGCTCTTA-4
10 AAACATTGAAGCCT-5
Is it possible to split this CellDataSet object into 6 smaller CellDataSet objects that each comprise barcodes with the same "-n" suffix, so I can analyse each sample separately? For example, the barcodes of CellDataSet1 would look like:
cds$barcode
1 AAACCGTGCCCTCA-1
2 AAACGCACACGCAT-1
3 AAACGGCTTCCGAA-1
4 AAAGACGAACCCAA-1
5 AAAGACGACTGTTT-1
6 AAAGAGACAAAGCA-1
7 AAAGATCTGGTAAA-1
8 AAAGCAGAGCAAGG-1
9 AAAGCAGATTATCC-1
10 AAAGCCTGATGACC-1
etc, and would contain the corresponding attributes as in the original object.
Many thanks!
Abigail
You can use tidyverse to solve the problem:
library(tidyverse)
dataseti <- data.frame(barcode = c("ACCAACGACTTGCC-1",
"GCACTACTCGATG-4",
"CGTACAGAGTATCG-5",
"CGTCAAGATCACCC-5",
"ACTGAGACCCGTAA-2",
"TTAGACCTCGGGAA-6",
"TTCAAGCTGGTATC-3",
"TTTGACTGTCCTTA-4",
"TTTGCATGCTCTTA-4",
"AAACATTGAAGCCT-5"),
stringsAsFactors = FALSE)
Let's say you want group 4
dataseti %>% separate(barcode, c("chain","group"),"-") %>% filter(group == 4)
Good luck!
Related
This quesiton is similar to my previous question (How to create a "householdID" for rows with shared "customerID" and "spouseID"?), although this version deals with a rats-nest mix of character and numeric strings instead of simply numeric IDs. I'm trying to create a "household ID" for all couples who appear in a larger dataframe. In short, each individual has a "customerID" and "spouseID". If a customerID is married, their spouse's ID appears in the "spouseID" column. If they are not married, the spouseID field is empty. Each member of a married couple will appear on its own row, resulting in the need for a common "householdID" that a couple shares.
What is the best way to and add a unique householdID that duplicates for couples? A small and over-simplified example of the original data is as follows. Note that the original IDs are far more complex, with varying lengths and patters of numbers and characters.
df <- data.frame(
prospectID=as.character(c("G1339jf", "6dhd54G1", "Cf14c", "Bvmkm1", "kda-1qati", "pwn9enr", "wj44v04t4t", "D15", "dkfs044nng", "v949s")),
spouseID=as.character( c( "", "wj44v04t4t", "", "pwn9enr", "", "Bvmkm1", "6dhd54G1", "", "v949s", "dkfs044nng")),
stringsAsFactors = FALSE)
> df
prospectID spouseID
1 G1339jf
2 6dhd54G1 wj44v04t4t
3 Cf14c
4 Bvmkm1 pwn9enr
5 kda-1qati
6 pwn9enr Bvmkm1
7 wj44v04t4t 6dhd54G1
8 D15
9 dkfs044nng v949s
10 v949s dkfs044nng
An example of my desired result is as follows:
> df
prospectID spouseID HouseholdID
1 G1339jf 1
2 6dhd54G1 wj44v04t4t 2
3 Cf14c 3
4 Bvmkm1 pwn9enr 4
5 kda-1qati 5
6 pwn9enr Bvmkm1 4
7 wj44v04t4t 6dhd54G1 2
8 D15 6
9 dkfs044nng v949s 7
10 v949s dkfs044nng 7
This is an edited solution due to comments made by OP.
Illustrative data:
df <- data.frame(
prospectID=as.character(c("A1jljljljl344asbvc", "A2&%$ll##fffh", "B1665453sskn:;", "B2gavQWEΩΩø⁄", "C1", "D1", "E1#+'&%", "E255646321", "F1", "G1")),
spouseID=as.character(c("A2&%$ll##fffh", "A1jljljljl344asöbvc", "B2gavQWEΩΩø⁄", "B1665453sskn:;_", "", "", "E255646321", "E1#+'&%", "", "")),
stringsAsFactors = FALSE)
First define a pattern to match:
patt <- paste(df$prospectID, df$spouseID, sep = "|")
Second, define a for loop; here, a little editing is necessary for the first and the last value. Maybe others can improve on this part:
for(i in 1:nrow(df)){
df$HousholdID[1] <- 1
df$HousholdID[i] <- ifelse(grepl(patt[i], df$prospectID[i+1]), 1, 0)
df$HousholdID[10] <- 1
}
The final step is to run cumsum:
df$HousholdID <- cumsum(df$HousholdID)
The result:
df
prospectID spouseID HousholdID
1 A1jljljljl344asbvc A2&%$ll##fffh 1
2 A2&%$ll##fffh A1jljljljl344asöbvc 1
3 B1665453sskn:; B2gavQWEΩΩø⁄ 2
4 B2gavQWEΩΩø⁄ B1665453sskn:;_ 2
5 C1 3
6 D1 4
7 E1#+'&% E255646321 5
8 E255646321 E1#+'&% 5
9 F1 6
10 G1 7
I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.
library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"
I have a small issue regarding a dataset I am using. Suppose I have a dataset called mergedData2 defined using those command lines from a subset of mergedData:
mergedData=rbind(test_set,training_set)
lookformean<-grep("mean()",names(mergedData),fixed=TRUE)
lookforstd<-grep("std()",names(mergedData),fixed=TRUE)
varsofinterests<-sort(c(lookformean,lookforstd))
mergedData2<-mergedData[,c(1:2,varsofinterests)]
If I do names(mergedData2), I get:
[1] "volunteer_identifier" "type_of_experiment"
[3] "body_acceleration_mean()-X" "body_acceleration_mean()-Y"
[5] "body_acceleration_mean()-Z" "body_acceleration_std()-X"
(I takes this 6 first names as MWE but I have a vector of 68 names)
Now, suppose I want to take the average of each of the measurements per volunteer_identifier and type_of_experiment. For this, I used a combination of split and lapply:
mylist<-split(mergedData2,list(mergedData2$volunteer_identifier,mergedData2$type_of_experiment))
average_activities<-lapply(mylist,function(x) colMeans(x))
average_dataset<-t(as.data.frame(average_activities))
As average_activities is a list, I converted it into a data frame and transposed this data frame to keep the same format as mergedData and mergedData2. The problem now is the following: when I call names(average_dataset), it returns NULL !! But, more strangely, when I do:head(average_dataset) ; it returns :
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 1 0.2773308 -0.01738382
2 1 0.2764266 -0.01859492
3 1 0.2755675 -0.01717678
4 1 0.2785820 -0.01483995
5 1 0.2778423 -0.01728503
6 1 0.2836589 -0.01689542
This is just a small sample of the output, to say that the names of the variables are there. So why names(average_dataset) returns NULL ?
Thanks in advance for your reply, best
EDIT: Here is an MWE for mergedData2:
volunteer_identifier type_of_experiment body_acceleration_mean()-X body_acceleration_mean()-Y
1 2 5 0.2571778 -0.02328523
2 2 5 0.2860267 -0.01316336
3 2 5 0.2754848 -0.02605042
4 2 5 0.2702982 -0.03261387
5 2 5 0.2748330 -0.02784779
6 2 5 0.2792199 -0.01862040
body_acceleration_mean()-Z body_acceleration_std()-X body_acceleration_std()-Y body_acceleration_std()-Z
1 -0.01465376 -0.9384040 -0.9200908 -0.6676833
2 -0.11908252 -0.9754147 -0.9674579 -0.9449582
3 -0.11815167 -0.9938190 -0.9699255 -0.9627480
4 -0.11752018 -0.9947428 -0.9732676 -0.9670907
5 -0.12952716 -0.9938525 -0.9674455 -0.9782950
6 -0.11390197 -0.9944552 -0.9704169 -0.9653163
gravity_acceleration_mean()-X gravity_acceleration_mean()-Y gravity_acceleration_mean()-Z
1 0.9364893 -0.2827192 0.1152882
2 0.9274036 -0.2892151 0.1525683
3 0.9299150 -0.2875128 0.1460856
4 0.9288814 -0.2933958 0.1429259
5 0.9265997 -0.3029609 0.1383067
6 0.9256632 -0.3089397 0.1305608
gravity_acceleration_std()-X gravity_acceleration_std()-Y gravity_acceleration_std()-Z
1 -0.9254273 -0.9370141 -0.5642884
2 -0.9890571 -0.9838872 -0.9647811
3 -0.9959365 -0.9882505 -0.9815796
4 -0.9931392 -0.9704192 -0.9915917
5 -0.9955746 -0.9709604 -0.9680853
6 -0.9988423 -0.9907387 -0.9712319
My duty is to get this average_dataset (which is a dataset which contains the average value for each physical quantity (column 3 and onwards) for each volunteer and type of experiment (e.g 1 1 mean1 mean2 mean3...mean68
2 1 mean1 mean2 mean3...mean68, etc)
After this I will have to export it as a txt file (so I think using write.table with row.names=F, and col.names=T). Note that for now, if I do this and import the dataset generated using read.table, I don't recover the names of the columns of the dataset; even while specifying col.names=T.
I have a list of genes as rownames of my eset and I want to convert them to Ensembl gene ID.
I used getGene in bioMart package but it took the same name twice for some genes!
here is a small example for my code:
library (biomaRt)
rownames(eset)
[1] "EPC1" "MYO3A" "PARD3" "ATRNL1" "GDF2" "IL10RA" "GAD2" "CCDC6"
getGene(rownames(eset),type='hgnc_symbol',mart)[c(1,9)]
# [1] is the hgnc_symbol to recheck the matched data
# [9] is the ensemble_gene_id
hgnc_symbol ensembl_gene_id
1 ATRNL1 ENSG00000107518
2 CCDC6 ENSG00000108091
3 EPC1 ENSG00000120616
4 GAD2 ENSG00000136750
5 GDF2 ENSG00000263761
6 IL10RA ENSG00000110324
7 IL10RA LRG_151
8 MYO3A ENSG00000095777
9 PARD3 ENSG00000148498
As you can see there are two entries for "IL10RA" in the hgnc_symbol column; but I only had one "IL10RA" in the rownames(eset); this causes a problem at the end when I wanted to add the Ensembl_ID to the fData(eset)!
How can I solve this problem?
to have result like this:
hgnc_symbol ensembl_gene_id
1 ATRNL1 ENSG00000107518
2 CCDC6 ENSG00000108091
3 EPC1 ENSG00000120616
4 GAD2 ENSG00000136750
5 GDF2 ENSG00000263761
6 IL10RA ENSG00000110324
7 MYO3A ENSG00000095777
8 PARD3 ENSG00000148498
Thanks in advance,
I've found the solution by !duplicated in the eset.
Something like this:
g_All <- getGene(id = rownames(eset)),type='hgnc_symbol',mart)
g_All <- g_All[!duplicated(g_All[,1]),]
I have some cross correlation function crosscor, and I would like to loop through the function for each of the columns I have in my data matrix. The function outputs some cross correlation that looks something like this each time it is run:
Lags Cross.Correlation P.value
1 0 -0.0006844958 0.993233547
2 1 0.1021006478 0.204691627
3 2 0.0976746274 0.226628526
4 3 0.1150337867 0.155426784
5 4 0.1943150900 0.016092041
6 5 0.2360415470 0.003416147
7 6 0.1855274375 0.022566685
8 7 0.0800646242 0.330081900
9 8 0.1111071269 0.177338885
10 9 0.0689602574 0.404948252
11 10 -0.0097332533 0.906856279
12 11 0.0146241719 0.860926388
13 12 0.0862549791 0.302268025
14 13 0.1283308019 0.125302070
15 14 0.0909537922 0.279988895
16 15 0.0628012627 0.457795228
17 16 0.1669241304 0.047886605
18 17 0.2019811994 0.016703619
19 18 0.1440124960 0.090764520
20 19 0.1104842808 0.197035340
21 20 0.1247428178 0.146396407
I would like put all of the lists together so they are in a data frame, and ultimately export it into a csv file so the columns are as follows: lags.3, cross-correlation.3, p-value.3, lags.3, cross-correlation.2....etc. until p.value.50.
I have tried to use do.call as follows, but have not been successful:
for(i in 3:50)
{
l1<-crosscor(data[,2], data[,i], lagmax=20)
ccdata<-do.call(rbind, l1)
cat("Data row", i)
}
I've also tried just creating the data frame straight out, but am just getting the lag column names:
ccdata <- data.frame()
for(i in 3:50)
{
ccdata[i-2:i+1]<-crosscor(data[,2], data[,i], lagmax=20)
cat("Data row", i)
}
What am I doing wrong? Or is there an online source on data sets I could access to figure out how to do this? Best,
There is a transpose method for data.frames. If "crosscor" is the name of the object just try this:
tcrosscor <- t(crosscor)
write.csv(tcrosscor, file="my_crosscor_1.csv")
The first row would be the Lag's; the second row, the Cross.Correlation's; the third row the P.value's. I suppose you could "flatten" it further so it would be entirely "horizontal" or "wide". Seems painful but this might go something like:
single_line <- as.data.frame(unlist(tcrosscor))
names(single_line) <- paste("Lag", 'Cross.Correlation', 'P.value'), rep(1:50, 3), sep=".")
write.csv(single_line, file="my_single_1.csv")