Xts format of data.frame - r

I have a dataframe with 2 columns in it. The first column contains POSIXct data, and the second contains some sample value integers. I cannot for the life of me figure out how to convert this dataframe to xts.
My data:
structure(list(SampleDateTime = list(1422835200000, 1423353600000,
1423958400000, 1433030400000, 1433635200000, 1434326400000,
1434844800000, 1444521600000, 1445731200000, 1453593600000,
1.455408e+12, 1420934400000, 1424563200000, 1425772800000,
1426982400000, 1430006400000, 1.431216e+12, 1.440288e+12,
1448150400000, 1460851200000), SampleValue = list(9L, 3L,
2L, 1733L, 19L, 6L, 1L, 17L, 7L, 23L, 147L, 1L, 17L, 1L,
1L, 19L, 1L, 11L, 2L, 91L), Dttm = structure(c(1107216000,
1107734400, 1108339200, 1117411200, 1118016000, 1118707200, 1119225600,
1128902400, 1130112000, 1137974400, 1139788800, 1105315200, 1108944000,
1110153600, 1111363200, 1114387200, 1115596800, 1124668800, 1132531200,
1145232000), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c("feature$attributes",
"feature$attributes1", "feature$attributes2", "feature$attributes3",
"feature$attributes4", "feature$attributes5", "feature$attributes6",
"feature$attributes7", "feature$attributes8", "feature$attributes9",
"feature$attributes10", "feature$attributes11", "feature$attributes12",
"feature$attributes13", "feature$attributes14", "feature$attributes15",
"feature$attributes16", "feature$attributes17", "feature$attributes18",
"feature$attributes19"), class = "data.frame")
>
I use this to try and make the conversion:
q <- as.xts(t, order.by = t$Dttm, dateFormat="POSIXct", frequency=NULL, RECLASS=FALSE)
I get the following errors:
Error in coredata.xts(x) : currently unsupported data type
Any help appreciated. I can't figure out why it says the DF is unsupported when it says they are in the documentation??

As pointed out in the comments your data is in a weird format in which the first two columns of the data frame are themselves lists rather than numeric columns so first unlist each column (this will not change any columns which are already not lists) giving dat.u and then convert. We assume dat is the data shown via dput in the question.
dat.u <- replace(dat, TRUE, lapply(dat, unlist))
z <- read.zoo(dat.u, index = "Dttm")
x <- as.xts(z)

Related

How to separate a dataframe based on specific string in column name [duplicate]

This question already has answers here:
Split string by last two characters in R? (/negative string indices)
(5 answers)
Closed 3 years ago.
I have a huge data that I cannot split into two sets
df<- structure(list(name = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), X3C_AALI_01A = c(651L, 2L, 1877L), X3C_AALJ_01B = c(419L,
2L, 1825L), X3C_AALK_01A = c(1310L, 52L, 1286L), X4H_AAAK_11B = c(2978L,
4L, 1389L), X5L_AAT0_01B = c(2576L, 15L, 1441L), X5L_AAT1_01A = c(2886L,
5L, 921L), X5T_A9QA_03A = c(929L, 3L, 935L), A1_A0SI_10A = c(1578L,
1L, 2217L), A1_A0SK_07C = c(3003L, 6L, 2984L), A1_A0SO_01A = c(6413L,
0L, 3577L), A1_A0SP_05B = c(5157L, 5L, 4596L), A2_A04P_01A = c(4283L,
6L, 2508L), X5L_AAh1_10A = c(2886L, 5L, 921L), X5T_A0QA_03A = c(929L,
3L, 935L), A1_A0Sm_10A = c(1578L, 1L, 2217L), A1_ArSK_01A = c(3003L,
6L, 2984L), A1_AfSO_01A = c(6413L, 0L, 3577L), A1_AuSP_05A = c(5157L,
5L, 4596L), A2_Ap4P_11A = c(4283L, 6L, 2508L)), class = "data.frame", row.names = c(NA,
-3L))
basically , I want to split the data based on the last character of the column name. for example if you look at the above data, the second column is like this 3C_AALI_01A which I want to generate two data sets based on the _01A
So those columns that have 01 to 09 values I want them to be in one data frame and those ones that have 10 to whatever number want them to be in the second data frame. For example in the above example data.
the columns with the following names should be in one data frame
3C_AALI_01A
3C_AALJ_01B
3C_AALK_01A
5L_AAT0_01B
5L_AAT1_01A
5T_A9QA_03A
A1_A0SK_07C
A1_A0SO_01A
A1_A0SP_05B
A2_A04P_01A
5T_A0QA_03A
A1_ArSK_01A
A1_AfSO_01A
A1_AuSP_05A
and the columns with the following names should be in another data frame
4H_AAAK_11B
A1_A0SI_10A
5L_AAh1_10A
A1_A0Sm_10A
A2_Ap4P_11A
df1 <- df[,grep('0[1-9].$',colnames(df))]
df2 <- df[,-grep('0[1-9].$',colnames(df))]
You could use tidyr::separate(..., last=-1) approach
which uses negative string indexing, which is what you really want here
also, your dataframe is transposed, it would be more normal to have one single column name with the names, and numerical columns a, b, c. Like t(df) without the unwanted coercion to string.

R Index error while trying to append multiple dataframes into one

So I have a massive dataframe and I'm trying to combine scores I calculated from multiple dataframes (about 17 dataframes) to this one dataframe and I need to do this process 12 different times. This is an example dataframe that I have
df=structure(list(ï..id = structure(c(2L, 7L, 5L, 4L, 3L, 1L, 6L,
8L), .Label = c("B12", "B7", "C2", "C9", "D3", "E2", "E6", "R4"
), class = "factor"), age = c(42L, 45L, 83L, 59L, 49L, 46L, 52L,
23L)), class = "data.frame", row.names = c(NA, -8L))
So I need to calculate network metrics using the igraph package. Here are 2 matrices I have with different people in them
net_mat1=structure(c("B7", "E6", "D3", "C9"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
net_mat2=structure(c("C2", "B12", "E2", "R4"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
Here is what I'm calculating
library(igraph)
g1=graph_from_edgelist(net_mat1)
g2=graph_from_edgelist(net_mat2)
degree_cent_close_1=centr_degree(g1, mode = "all")
degree.cent_close_1 #create object that contains metrics
degree.cent_close2=centr_degree(g2, mode = "all")
degree.cent_close2 #create another object that contains metrics
I then create dataframes that contain the metrics I calculated
cent_score_df1=data.frame(degree_cent_close_1$res, V(g1)$name)
cent_score_df1
cent_score_df2=data.frame(degree.cent_close2$res, V(g2)$name)
cent_score_df2
I then try to match and index the the values of these metrics back into the df dataframe doing this
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
df$centrality_scores
df$centrality_scores <- cent_score_df2[ match(df[['id']], cent_score_df2[['V.g2..name']] ) , 'degree.cent_close2.res']
df$centrality_scores
However, it seems each time I try to merge my data with the original dataframe it can only attach half the data. I can never attach both dataframes. Does anyone have a better method that works for re-attaching data? If there are faster and cleaner ways of doing this I would greatly appreciate the input
The problem with this line of code, is you are not selecting the rows in the original data.frame to update, instead you are just updating the first 4 rows.
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
What you intended was to do this:
df$centrality_scores<-NA
df$centrality_scores[na.omit(match(df$id, cent_score_df1$id))]<- cent_score_df1$V.g1..name
Another way to solve this is standardize the column names of your metric data frames and then use the merge function to add the results back to your original data frame.
names(cent_score_df1)<-c("centrality_scores", "id")
names(cent_score_df2)<-c("centrality_scores", "id")
cent_score<-rbind(cent_score_df1, cent_score_df2)
merge(df, cent_score, by.x="id", by.y="id")

arranging strings from one data frame based on another one

I have a data frame like this one
df1<- structure(list(V1 = structure(c(8L, 4L, 5L, 7L, 6L, 3L, 9L, 1L,
2L), .Label = c("A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4", "A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920",
"C1P641;C1P640;A0A061AD21;G5EEV6", "O16276", "O16520-2", "O17323-2",
"O17395", "O17403", "Q22501;A0A061AE05"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))
My second data from looks like this
df2<- structure(list(From = structure(c(12L, 10L, 11L, 8L, 7L, 1L,
9L, 15L, 2L, 5L, 13L, 3L, 16L, 6L, 4L, 14L), .Label = c("A0A061AD21",
"A0A061AE05", "A0A061AJ82", "A0A061AJK8", "A0A061AKW6", "A0A061AL89",
"C1P640", "C1P641", "G5EEV6", "O16276", "O17395", "O17403", "Q19219",
"Q21920", "Q22501", "Q7JLR4"), class = "factor"), To = structure(c(4L,
8L, 1L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 3L, 3L, 7L), .Label = c("aat-3",
"CELE_F08G5.3", "CELE_R11A8.7", "cpsf-2", "epi-1", "pps-1", "R11A8.7",
"ugt-61"), class = "factor")), .Names = c("From", "To"), class = "data.frame", row.names = c(NA,
-16L))
df2 is taken from df1 but some information are added and some are removed . I want to reconstruct the df2 like df1 and arrange the column named To based on that
So the output should look like this
From To
O17403 cpsf-2
O16276 ugt-61
O16520-2 -
O17395 aat-3
O17323-2 -
C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
Q22501;A0A061AE05 pps-1
A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7; R11AB.7
It means we have O17403 in df2 and was only one string in df1, so it stays the same. O16276 was only one string in a raw in df1 so it also stays the same
O16520-2 was in df1 was not in df2 so in column named to a hyphen
the same for the rest until C1P641;C1P640;A0A061AD21;G5EEV6 are all in the same row of df1 and their To is the same, so we put them the same as df1 and just add one epi-1
Probably the best is to put df1 as template and then parse the To to it , those that are in df2, parse their To , those that are not only a hyphen
It is very complicated, I even could not think how to do it.I will appreciate any help
To solve this I split the semicolon delimited strings and created a nested for-for-if-if loop.
Here's the logic behind the loop which runs against the split string's data.frame (tmp):
Fix data classes (i.e. change factor to character to avoid conflicting level sets) and append a temporary To column to tmp
For each column and row of tmp start by seeing if a cell contains a valid string for matching and a matched value in df2$To, if not, go to the next iteration
If it does then look at the matching value in To from df2, checking to see if we already have the matched value in tmp$To (if so, go to next iteration)
If there's a new matched value in df2$To then put it in the correspond cell of tmp$To, prepending it with any preceeding matches and semicolons if it is not the first match for that row
df1$V1 <- as.character(df1$V1)
df2$From <- as.character(df2$From)
df2$To <- as.character(df2$To)
library(stringr)
tmp <- as.data.frame(str_split_fixed(df1$V1, ";",n=5), stringsAsFactors = F)
tmp$To <- as.character(NA)
for(j in 1:nrow(tmp)){
for(i in 1:ncol(tmp)){
if(length(df2$To[df2$From == tmp[j,i]]) == 0 | is.null(tmp[j,i])){
next
} else if(length(df2$To[df2$From == tmp[j,i]] ) == 1 & !is.na(tmp[j,i])){
if(is.na(tmp$To[j]) | tmp$To[j] == df2$To[df2$From == tmp[j,i]]){
tmp$To[j] <- df2$To[df2$From == tmp[j,i] ]
} else{
tmp$To[j] <- paste(tmp$To[j],";",df2$To[df2$From == tmp[j,i] ], sep="")
}
} else{
next
}
}
}
df1 <- data.frame(From=df1$V1, To=tmp$To)
df1
From To
1 O17403 cpsf-2
2 O16276 ugt-61
3 O16520-2 <NA>
4 O17395 aat-3
5 O17323-2 <NA>
6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
7 Q22501;A0A061AE05 pps-1
8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
One way of doing this is to use the splitstackshape package (use cSplit). I converted the factors to character strings to simplify (and get rid of warnings).
library(dplyr)
library(data.table) # cSplit from 'splitstackshape' returns a 'data.table'.
library(splitstackshape)
### Remove the factors for convenience of manipulation
df1 <- df1 %>% mutate(From = as.character(V1))
df2 <- df2 %>% mutate(From = as.character(From), To = as.character(To))
### 'cSplit' will split on ';' and create a new row for each item. The
### original 'From' column is kept around as cSplit removes the split column.
### 'rn' (row number) is used for ordering later.
cSplit(df1 %>% mutate(rn = row_number(), From_temp = From),
"From_temp", sep = ";", direction = "long", drop = FALSE, type.convert = FALSE) %>%
left_join(df2, by = c(From_temp = 'From')) %>% # Join to 'df2' to get the 'To' column
group_by(From, rn) %>% # Group by original 'From' column.
summarise(To = paste(sort(unique(na.omit(To))), collapse = ';'), # Create 'To' by joining 'To' Values
To = ifelse(To=='', '-', To)) %>% # Set empty values to '-'
ungroup %>%
arrange(rn) %>% # Sort by original row number and
select(-rn) # remove 'rn' column.
## From To
## <chr> <chr>
## 1 O17403 cpsf-2
## 2 O16276 ugt-61
## 3 O16520-2 -
## 4 O17395 aat-3
## 5 O17323-2 -
## 6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
## 7 Q22501;A0A061AE05 pps-1
## 8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
## 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
There may be a cleaner way to do with dplyr that doesn't require the splitstackshape.

Constructing All Possible Pairs within Groups

I have a large amount of graph data in the following form. Suppose a person has multiple interests.
person,interest
1,1
1,2
1,3
2,1
2,5
2,2
3,2
3,5
...
I want to construct all pairs of interests for each user. I would like to convert this into an edgelist like the following. I want the data in this format so that I can convert it into an adjacency matrix for graphing etc.
person,x_interest,y_interest
1,1,2
1,1,3
1,2,3
2,1,5
2,1,2
2,5,2
3,2,5
There is one solution here: Pairs of Observations within Groups but it works only for small datasets as the call to table wants to generate more than 2^31 elements. Is there another way that I can do this without having to rely on table?
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'person', we get the unique pairwise combinations of 'interest' to create two columns ('x_interest' and 'y_interest').
library(data.table)
setDT(df1)[,{tmp <- combn(unique(interest),2)
list(x_interest=tmp[c(TRUE, FALSE)], y_interest= tmp[c(FALSE, TRUE)])} , by = person]
NOTE: To speed up, combnPrim from library(gRbase) could be used in place of combn.
data
df1 <- structure(list(person = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
interest = c(1L,
2L, 3L, 1L, 5L, 2L, 2L, 5L)), .Names = c("person", "interest"
), class = "data.frame", row.names = c(NA, -8L))

How to split/subset a dataframe into multiple dataframes in R

I have looked through the web and stackflow and I was not able to find a solution to my problem. I don't know which of dplyr or a loop would be more efficient.
Below an example of a dataframe (my own datasets have more than 10,000 rows) I would like to split in three based on column B (<250) as a list with three objects or as three individual dataframes. Then for each new dataframe, I would like, for example, count the number of points (or length of the dataframe) and the duration (column Time is in seconds). Any suggestion would be really appreciated.
thank you
Martin
dput(mydata)
structure(list(Time = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 0L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), A = c(4L, 5L, 6L, 7L,
3L, 7L, 8L, 10L, 11L, 8L, 10L, 12L, 14L, 6L, 14L, 16L, 20L, 22L
), B = c(100.25, 150.75, 200, 1000.56, 2000.1, 100, 150, 50,
25.2, 102.25, 152.75, 202, 1002.56, 2002.1, 102, 152, 52, 27.2
)), .Names = c("Time", "A", "B"), class = "data.frame", row.names = c(NA,
-18L))
Grab IRanges from Bioconductor:
runs <- slice(Rle(df$B), upper=250)
This is an RleViews object, with a view (range) for every run under 250. You can extract the width of the views (the number of points that would be in each data frame):
width(runs)
You can split the data frame into a list like this:
blocks <- extractList(df, ranges(runs))
Note that blocks is now a formal SplitDataFrameList.
To compute the duration, you can extract the Time column as an IntegerList and compute the difference between the last and first element of every list element:
time <- blocks[,"Time"]
ptail(time, 1) - phead(time, 1)
This happens without actually forming separate list elements (the list is lazily managed) and so should be fast.
It's not clear how your specifications line up with your expected output. Here are two different method of splitting:
# Gives three groups
split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%% 3)
# Gives groups of size three
split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%/% 3)
This shows how to count the numbers of rows from the first method:
> three <- split( mydata[mydata$B <250, ] , (1:nrow(mydata[mydata$B <250, ]))%% 3)
> lapply(three, nrow)
$`0`
[1] 4
$`1`
[1] 5
$`2`
[1] 5

Resources