I have looked through the web and stackflow and I was not able to find a solution to my problem. I don't know which of dplyr or a loop would be more efficient.
Below an example of a dataframe (my own datasets have more than 10,000 rows) I would like to split in three based on column B (<250) as a list with three objects or as three individual dataframes. Then for each new dataframe, I would like, for example, count the number of points (or length of the dataframe) and the duration (column Time is in seconds). Any suggestion would be really appreciated.
thank you
Martin
dput(mydata)
structure(list(Time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 0L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), A = c(4L, 5L, 6L, 7L,
3L, 7L, 8L, 10L, 11L, 8L, 10L, 12L, 14L, 6L, 14L, 16L, 20L, 22L
), B = c(100.25, 150.75, 200, 1000.56, 2000.1, 100, 150, 50,
25.2, 102.25, 152.75, 202, 1002.56, 2002.1, 102, 152, 52, 27.2
)), .Names = c("Time", "A", "B"), class = "data.frame", row.names = c(NA,
-18L))
Grab IRanges from Bioconductor:
runs <- slice(Rle(df$B), upper=250)
This is an RleViews object, with a view (range) for every run under 250. You can extract the width of the views (the number of points that would be in each data frame):
width(runs)
You can split the data frame into a list like this:
blocks <- extractList(df, ranges(runs))
Note that blocks is now a formal SplitDataFrameList.
To compute the duration, you can extract the Time column as an IntegerList and compute the difference between the last and first element of every list element:
time <- blocks[,"Time"]
ptail(time, 1) - phead(time, 1)
This happens without actually forming separate list elements (the list is lazily managed) and so should be fast.
It's not clear how your specifications line up with your expected output. Here are two different method of splitting:
# Gives three groups
split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%% 3)
# Gives groups of size three
split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%/% 3)
This shows how to count the numbers of rows from the first method:
> three <- split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%% 3)
> lapply(three, nrow)
$`0`
[1] 4
$`1`
[1] 5
$`2`
[1] 5
Related
In the map dataframe, the first column is chromosome and the second column is snp.
> dput(map[1:10,])
structure(list(V1 = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
15L, 15L), V2 = c("SNP_A-1850451", "SNP_A-1910129", "SNP_A-1793338",
"SNP_A-1938260", "SNP_A-2269619", "SNP_A-2246474", "SNP_A-2275061",
"SNP_A-1961089", "SNP_A-2131173", "SNP_A-4227303"), V3 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), V4 = c(41361016L, 41366515L,
41386408L, 41396801L, 41410208L, 41419430L, 41440229L, 41446338L,
41449654L, 41455192L)), row.names = c(NA, 10L), class = "data.frame")
The mc dataframe is a subset of map based on the snps vector.
> snps <- c("SNP_A-2240938","SNP_A-2249234","SNP_A-2120212","SNP_A-1864785","SNP_A-1944204","SNP_A-2101022","SNP_A-2192181","SNP_A-4296300","SNP_A-1961028","SNP_A-2215249","SNP_A-2172180","SNP_A-1945117","SNP_A-1941491")
>
> # Subset the map file by the SNPs cis
> mc <- map[map[,2] %in% snps,]
Now I want to extract all of the remaining SNPs only on chromosome 14. I also want to use setdiff() to get the remaining SNPs that are different from those in mc. Then use these SNP names with the match() function on the original map file.
> # Extract all of the remaining SNPs only on chromosome 14
> for (i in 1:nrow(map)) {
+ if (map$V1==14 && setdiff(map,mc)) {
+ t.snp <- map[i,]
+ }
+ }
>
> t.snp
data frame with 0 columns and 10784 rows
Problem:
My t.snp dataframe has 0 columns.
I have a dataframe with 2 columns in it. The first column contains POSIXct data, and the second contains some sample value integers. I cannot for the life of me figure out how to convert this dataframe to xts.
My data:
structure(list(SampleDateTime = list(1422835200000, 1423353600000,
1423958400000, 1433030400000, 1433635200000, 1434326400000,
1434844800000, 1444521600000, 1445731200000, 1453593600000,
1.455408e+12, 1420934400000, 1424563200000, 1425772800000,
1426982400000, 1430006400000, 1.431216e+12, 1.440288e+12,
1448150400000, 1460851200000), SampleValue = list(9L, 3L,
2L, 1733L, 19L, 6L, 1L, 17L, 7L, 23L, 147L, 1L, 17L, 1L,
1L, 19L, 1L, 11L, 2L, 91L), Dttm = structure(c(1107216000,
1107734400, 1108339200, 1117411200, 1118016000, 1118707200, 1119225600,
1128902400, 1130112000, 1137974400, 1139788800, 1105315200, 1108944000,
1110153600, 1111363200, 1114387200, 1115596800, 1124668800, 1132531200,
1145232000), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c("feature$attributes",
"feature$attributes1", "feature$attributes2", "feature$attributes3",
"feature$attributes4", "feature$attributes5", "feature$attributes6",
"feature$attributes7", "feature$attributes8", "feature$attributes9",
"feature$attributes10", "feature$attributes11", "feature$attributes12",
"feature$attributes13", "feature$attributes14", "feature$attributes15",
"feature$attributes16", "feature$attributes17", "feature$attributes18",
"feature$attributes19"), class = "data.frame")
>
I use this to try and make the conversion:
q <- as.xts(t, order.by = t$Dttm, dateFormat="POSIXct", frequency=NULL, RECLASS=FALSE)
I get the following errors:
Error in coredata.xts(x) : currently unsupported data type
Any help appreciated. I can't figure out why it says the DF is unsupported when it says they are in the documentation??
As pointed out in the comments your data is in a weird format in which the first two columns of the data frame are themselves lists rather than numeric columns so first unlist each column (this will not change any columns which are already not lists) giving dat.u and then convert. We assume dat is the data shown via dput in the question.
dat.u <- replace(dat, TRUE, lapply(dat, unlist))
z <- read.zoo(dat.u, index = "Dttm")
x <- as.xts(z)
This question already has answers here:
Split string by last two characters in R? (/negative string indices)
(5 answers)
Closed 3 years ago.
I have a huge data that I cannot split into two sets
df<- structure(list(name = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), X3C_AALI_01A = c(651L, 2L, 1877L), X3C_AALJ_01B = c(419L,
2L, 1825L), X3C_AALK_01A = c(1310L, 52L, 1286L), X4H_AAAK_11B = c(2978L,
4L, 1389L), X5L_AAT0_01B = c(2576L, 15L, 1441L), X5L_AAT1_01A = c(2886L,
5L, 921L), X5T_A9QA_03A = c(929L, 3L, 935L), A1_A0SI_10A = c(1578L,
1L, 2217L), A1_A0SK_07C = c(3003L, 6L, 2984L), A1_A0SO_01A = c(6413L,
0L, 3577L), A1_A0SP_05B = c(5157L, 5L, 4596L), A2_A04P_01A = c(4283L,
6L, 2508L), X5L_AAh1_10A = c(2886L, 5L, 921L), X5T_A0QA_03A = c(929L,
3L, 935L), A1_A0Sm_10A = c(1578L, 1L, 2217L), A1_ArSK_01A = c(3003L,
6L, 2984L), A1_AfSO_01A = c(6413L, 0L, 3577L), A1_AuSP_05A = c(5157L,
5L, 4596L), A2_Ap4P_11A = c(4283L, 6L, 2508L)), class = "data.frame", row.names = c(NA,
-3L))
basically , I want to split the data based on the last character of the column name. for example if you look at the above data, the second column is like this 3C_AALI_01A which I want to generate two data sets based on the _01A
So those columns that have 01 to 09 values I want them to be in one data frame and those ones that have 10 to whatever number want them to be in the second data frame. For example in the above example data.
the columns with the following names should be in one data frame
3C_AALI_01A
3C_AALJ_01B
3C_AALK_01A
5L_AAT0_01B
5L_AAT1_01A
5T_A9QA_03A
A1_A0SK_07C
A1_A0SO_01A
A1_A0SP_05B
A2_A04P_01A
5T_A0QA_03A
A1_ArSK_01A
A1_AfSO_01A
A1_AuSP_05A
and the columns with the following names should be in another data frame
4H_AAAK_11B
A1_A0SI_10A
5L_AAh1_10A
A1_A0Sm_10A
A2_Ap4P_11A
df1 <- df[,grep('0[1-9].$',colnames(df))]
df2 <- df[,-grep('0[1-9].$',colnames(df))]
You could use tidyr::separate(..., last=-1) approach
which uses negative string indexing, which is what you really want here
also, your dataframe is transposed, it would be more normal to have one single column name with the names, and numerical columns a, b, c. Like t(df) without the unwanted coercion to string.
So I have a massive dataframe and I'm trying to combine scores I calculated from multiple dataframes (about 17 dataframes) to this one dataframe and I need to do this process 12 different times. This is an example dataframe that I have
df=structure(list(ï..id = structure(c(2L, 7L, 5L, 4L, 3L, 1L, 6L,
8L), .Label = c("B12", "B7", "C2", "C9", "D3", "E2", "E6", "R4"
), class = "factor"), age = c(42L, 45L, 83L, 59L, 49L, 46L, 52L,
23L)), class = "data.frame", row.names = c(NA, -8L))
So I need to calculate network metrics using the igraph package. Here are 2 matrices I have with different people in them
net_mat1=structure(c("B7", "E6", "D3", "C9"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
net_mat2=structure(c("C2", "B12", "E2", "R4"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
Here is what I'm calculating
library(igraph)
g1=graph_from_edgelist(net_mat1)
g2=graph_from_edgelist(net_mat2)
degree_cent_close_1=centr_degree(g1, mode = "all")
degree.cent_close_1 #create object that contains metrics
degree.cent_close2=centr_degree(g2, mode = "all")
degree.cent_close2 #create another object that contains metrics
I then create dataframes that contain the metrics I calculated
cent_score_df1=data.frame(degree_cent_close_1$res, V(g1)$name)
cent_score_df1
cent_score_df2=data.frame(degree.cent_close2$res, V(g2)$name)
cent_score_df2
I then try to match and index the the values of these metrics back into the df dataframe doing this
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
df$centrality_scores
df$centrality_scores <- cent_score_df2[ match(df[['id']], cent_score_df2[['V.g2..name']] ) , 'degree.cent_close2.res']
df$centrality_scores
However, it seems each time I try to merge my data with the original dataframe it can only attach half the data. I can never attach both dataframes. Does anyone have a better method that works for re-attaching data? If there are faster and cleaner ways of doing this I would greatly appreciate the input
The problem with this line of code, is you are not selecting the rows in the original data.frame to update, instead you are just updating the first 4 rows.
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
What you intended was to do this:
df$centrality_scores<-NA
df$centrality_scores[na.omit(match(df$id, cent_score_df1$id))]<- cent_score_df1$V.g1..name
Another way to solve this is standardize the column names of your metric data frames and then use the merge function to add the results back to your original data frame.
names(cent_score_df1)<-c("centrality_scores", "id")
names(cent_score_df2)<-c("centrality_scores", "id")
cent_score<-rbind(cent_score_df1, cent_score_df2)
merge(df, cent_score, by.x="id", by.y="id")
I would like to match two data frame based on a certain column. My data frames are attached below
df <- structure(list(Read = structure(1:3, .Label = c("CC", "CG", "GC"
), class = "factor"), index = c(6L, 7L, 10L)), .Names = c("Read",
"index"), row.names = c(NA, -3L), class = "data.frame")
df1 <- structure(list(Ref_base = structure(c(1L, 6L, 4L, 2L, 3L, 4L,
3L, 5L), .Label = c("AT", "CC", "CG", "GC", "GT", "TG"), class = "factor"),
index = c(4L, 15L, 10L, 6L, 7L, 10L, 7L, 12L)), .Names = c("Ref_base",
"index"), row.names = c(NA, -8L), class = "data.frame")
I use match to find the match between the two data frames
match(df$index,df1$index)
and it gives me the correct result 4 5 3 as the index of matches. But i would like to lock down position 4 which is the index of first match and perform the match after 4 or whatever the first index is. I don't want to perform the search beyond the index of first match. For example i am interested to return the indexes as 4,5,6 including repetition if any.
The first solution is basically not more than a loop. It loops through all search elements from df$index and returns the match indices in tmp. The variable search_start is used to let the next search begin from the most recent position. Since search_start was defined outside of the anonymous function in sapply you have to use <<- instead of = or <- to access it. There is also some code for handling NAs (this was missing in the first version of my answer).
match_sapply=function(a,b) {
search_start=1
tmp2=sapply(a,function(x) {
tmp=match(x,b[search_start:nrow(df1)])
search_start<<-search_start+ifelse(is.na(tmp),0,tmp)
tmp
})
#the following line updates all non-NA elements of tmp2 with its cumulative sum
`[<-`(tmp2,!is.na(tmp2),cumsum(tmp2[!is.na(tmp2)]))
}
match_sapply(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA
And another version using Recall. This is a recursive approach. Recall calls the function from which it was called (in our case match_recall) again. But you can provide different arguments. The arguments of match_recall are: x the search terms, y target vector, n recursion level (also selects specific element of x), si start index (same as start_index in previous solution). Again, there is some code that handles NAs.
match_recall=function(x,y,n=1,si=1) {
tmp=match(x[n],y[si:length(y)])
tmp1=tmp
if (is.na(tmp1)) tmp1=0
if (length(x)==n) {
return(tmp)
} else {
c(tmp,tmp1+Recall(x,y,n+1,si+tmp1))
}
}
match_recall(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA