R Index error while trying to append multiple dataframes into one - r

So I have a massive dataframe and I'm trying to combine scores I calculated from multiple dataframes (about 17 dataframes) to this one dataframe and I need to do this process 12 different times. This is an example dataframe that I have
df=structure(list(ï..id = structure(c(2L, 7L, 5L, 4L, 3L, 1L, 6L,
8L), .Label = c("B12", "B7", "C2", "C9", "D3", "E2", "E6", "R4"
), class = "factor"), age = c(42L, 45L, 83L, 59L, 49L, 46L, 52L,
23L)), class = "data.frame", row.names = c(NA, -8L))
So I need to calculate network metrics using the igraph package. Here are 2 matrices I have with different people in them
net_mat1=structure(c("B7", "E6", "D3", "C9"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
net_mat2=structure(c("C2", "B12", "E2", "R4"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
Here is what I'm calculating
library(igraph)
g1=graph_from_edgelist(net_mat1)
g2=graph_from_edgelist(net_mat2)
degree_cent_close_1=centr_degree(g1, mode = "all")
degree.cent_close_1 #create object that contains metrics
degree.cent_close2=centr_degree(g2, mode = "all")
degree.cent_close2 #create another object that contains metrics
I then create dataframes that contain the metrics I calculated
cent_score_df1=data.frame(degree_cent_close_1$res, V(g1)$name)
cent_score_df1
cent_score_df2=data.frame(degree.cent_close2$res, V(g2)$name)
cent_score_df2
I then try to match and index the the values of these metrics back into the df dataframe doing this
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
df$centrality_scores
df$centrality_scores <- cent_score_df2[ match(df[['id']], cent_score_df2[['V.g2..name']] ) , 'degree.cent_close2.res']
df$centrality_scores
However, it seems each time I try to merge my data with the original dataframe it can only attach half the data. I can never attach both dataframes. Does anyone have a better method that works for re-attaching data? If there are faster and cleaner ways of doing this I would greatly appreciate the input

The problem with this line of code, is you are not selecting the rows in the original data.frame to update, instead you are just updating the first 4 rows.
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
What you intended was to do this:
df$centrality_scores<-NA
df$centrality_scores[na.omit(match(df$id, cent_score_df1$id))]<- cent_score_df1$V.g1..name
Another way to solve this is standardize the column names of your metric data frames and then use the merge function to add the results back to your original data frame.
names(cent_score_df1)<-c("centrality_scores", "id")
names(cent_score_df2)<-c("centrality_scores", "id")
cent_score<-rbind(cent_score_df1, cent_score_df2)
merge(df, cent_score, by.x="id", by.y="id")

Related

Xts format of data.frame

I have a dataframe with 2 columns in it. The first column contains POSIXct data, and the second contains some sample value integers. I cannot for the life of me figure out how to convert this dataframe to xts.
My data:
structure(list(SampleDateTime = list(1422835200000, 1423353600000,
1423958400000, 1433030400000, 1433635200000, 1434326400000,
1434844800000, 1444521600000, 1445731200000, 1453593600000,
1.455408e+12, 1420934400000, 1424563200000, 1425772800000,
1426982400000, 1430006400000, 1.431216e+12, 1.440288e+12,
1448150400000, 1460851200000), SampleValue = list(9L, 3L,
2L, 1733L, 19L, 6L, 1L, 17L, 7L, 23L, 147L, 1L, 17L, 1L,
1L, 19L, 1L, 11L, 2L, 91L), Dttm = structure(c(1107216000,
1107734400, 1108339200, 1117411200, 1118016000, 1118707200, 1119225600,
1128902400, 1130112000, 1137974400, 1139788800, 1105315200, 1108944000,
1110153600, 1111363200, 1114387200, 1115596800, 1124668800, 1132531200,
1145232000), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c("feature$attributes",
"feature$attributes1", "feature$attributes2", "feature$attributes3",
"feature$attributes4", "feature$attributes5", "feature$attributes6",
"feature$attributes7", "feature$attributes8", "feature$attributes9",
"feature$attributes10", "feature$attributes11", "feature$attributes12",
"feature$attributes13", "feature$attributes14", "feature$attributes15",
"feature$attributes16", "feature$attributes17", "feature$attributes18",
"feature$attributes19"), class = "data.frame")
>
I use this to try and make the conversion:
q <- as.xts(t, order.by = t$Dttm, dateFormat="POSIXct", frequency=NULL, RECLASS=FALSE)
I get the following errors:
Error in coredata.xts(x) : currently unsupported data type
Any help appreciated. I can't figure out why it says the DF is unsupported when it says they are in the documentation??
As pointed out in the comments your data is in a weird format in which the first two columns of the data frame are themselves lists rather than numeric columns so first unlist each column (this will not change any columns which are already not lists) giving dat.u and then convert. We assume dat is the data shown via dput in the question.
dat.u <- replace(dat, TRUE, lapply(dat, unlist))
z <- read.zoo(dat.u, index = "Dttm")
x <- as.xts(z)

How to separate a dataframe based on specific string in column name [duplicate]

This question already has answers here:
Split string by last two characters in R? (/negative string indices)
(5 answers)
Closed 3 years ago.
I have a huge data that I cannot split into two sets
df<- structure(list(name = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), X3C_AALI_01A = c(651L, 2L, 1877L), X3C_AALJ_01B = c(419L,
2L, 1825L), X3C_AALK_01A = c(1310L, 52L, 1286L), X4H_AAAK_11B = c(2978L,
4L, 1389L), X5L_AAT0_01B = c(2576L, 15L, 1441L), X5L_AAT1_01A = c(2886L,
5L, 921L), X5T_A9QA_03A = c(929L, 3L, 935L), A1_A0SI_10A = c(1578L,
1L, 2217L), A1_A0SK_07C = c(3003L, 6L, 2984L), A1_A0SO_01A = c(6413L,
0L, 3577L), A1_A0SP_05B = c(5157L, 5L, 4596L), A2_A04P_01A = c(4283L,
6L, 2508L), X5L_AAh1_10A = c(2886L, 5L, 921L), X5T_A0QA_03A = c(929L,
3L, 935L), A1_A0Sm_10A = c(1578L, 1L, 2217L), A1_ArSK_01A = c(3003L,
6L, 2984L), A1_AfSO_01A = c(6413L, 0L, 3577L), A1_AuSP_05A = c(5157L,
5L, 4596L), A2_Ap4P_11A = c(4283L, 6L, 2508L)), class = "data.frame", row.names = c(NA,
-3L))
basically , I want to split the data based on the last character of the column name. for example if you look at the above data, the second column is like this 3C_AALI_01A which I want to generate two data sets based on the _01A
So those columns that have 01 to 09 values I want them to be in one data frame and those ones that have 10 to whatever number want them to be in the second data frame. For example in the above example data.
the columns with the following names should be in one data frame
3C_AALI_01A
3C_AALJ_01B
3C_AALK_01A
5L_AAT0_01B
5L_AAT1_01A
5T_A9QA_03A
A1_A0SK_07C
A1_A0SO_01A
A1_A0SP_05B
A2_A04P_01A
5T_A0QA_03A
A1_ArSK_01A
A1_AfSO_01A
A1_AuSP_05A
and the columns with the following names should be in another data frame
4H_AAAK_11B
A1_A0SI_10A
5L_AAh1_10A
A1_A0Sm_10A
A2_Ap4P_11A
df1 <- df[,grep('0[1-9].$',colnames(df))]
df2 <- df[,-grep('0[1-9].$',colnames(df))]
You could use tidyr::separate(..., last=-1) approach
which uses negative string indexing, which is what you really want here
also, your dataframe is transposed, it would be more normal to have one single column name with the names, and numerical columns a, b, c. Like t(df) without the unwanted coercion to string.

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Dynamic use of match function

I would like to match two data frame based on a certain column. My data frames are attached below
df <- structure(list(Read = structure(1:3, .Label = c("CC", "CG", "GC"
), class = "factor"), index = c(6L, 7L, 10L)), .Names = c("Read",
"index"), row.names = c(NA, -3L), class = "data.frame")
df1 <- structure(list(Ref_base = structure(c(1L, 6L, 4L, 2L, 3L, 4L,
3L, 5L), .Label = c("AT", "CC", "CG", "GC", "GT", "TG"), class = "factor"),
index = c(4L, 15L, 10L, 6L, 7L, 10L, 7L, 12L)), .Names = c("Ref_base",
"index"), row.names = c(NA, -8L), class = "data.frame")
I use match to find the match between the two data frames
match(df$index,df1$index)
and it gives me the correct result 4 5 3 as the index of matches. But i would like to lock down position 4 which is the index of first match and perform the match after 4 or whatever the first index is. I don't want to perform the search beyond the index of first match. For example i am interested to return the indexes as 4,5,6 including repetition if any.
The first solution is basically not more than a loop. It loops through all search elements from df$index and returns the match indices in tmp. The variable search_start is used to let the next search begin from the most recent position. Since search_start was defined outside of the anonymous function in sapply you have to use <<- instead of = or <- to access it. There is also some code for handling NAs (this was missing in the first version of my answer).
match_sapply=function(a,b) {
search_start=1
tmp2=sapply(a,function(x) {
tmp=match(x,b[search_start:nrow(df1)])
search_start<<-search_start+ifelse(is.na(tmp),0,tmp)
tmp
})
#the following line updates all non-NA elements of tmp2 with its cumulative sum
`[<-`(tmp2,!is.na(tmp2),cumsum(tmp2[!is.na(tmp2)]))
}
match_sapply(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA
And another version using Recall. This is a recursive approach. Recall calls the function from which it was called (in our case match_recall) again. But you can provide different arguments. The arguments of match_recall are: x the search terms, y target vector, n recursion level (also selects specific element of x), si start index (same as start_index in previous solution). Again, there is some code that handles NAs.
match_recall=function(x,y,n=1,si=1) {
tmp=match(x[n],y[si:length(y)])
tmp1=tmp
if (is.na(tmp1)) tmp1=0
if (length(x)==n) {
return(tmp)
} else {
c(tmp,tmp1+Recall(x,y,n+1,si+tmp1))
}
}
match_recall(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA

Unsplitting a data frame by a variable (different length of factors)

I have a data frame (st1) that I split by a factor. I then performed functions to the split data (i.e. mean) by another factor and hence, I cannot perform unsplit any more because my original data frame is of different length now.
As to walk you through what I did, here is a code:
NT = data.table(st1)
NT2=split (NT, NT$bin)
NT3 <- data.frame(sapply( NT2 , function(x) x[, list(ang=length(unique(thetadeg)), len=length(T), Vm=mean(V)), by=c("A")]))
head of the st1:
structure(list(A = c(25L, 25L, 25L, 25L, 25L, 25L), T = 56:61,
X = c(481.07, 487.04, 490.03, 499, 504.97, 507.96), Y = c(256.97,
256.97, 256.97, 256.97, 256.97, 256.97), V = c(4.482, 5.976,
7.47, 4.482, 5.976, 7.47), thetarad = c(0.164031585831919,
0.169139558949956, 0.171661200692621, 0.179083242584008,
0.183907246800473, 0.186289411097781), thetadeg = c(9.39831757286096,
9.69098287432395, 9.83546230358968, 10.2607139792383, 10.537109061132,
10.6735970214433), bin = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("binA", "binB", "binC", "outbin"), class = "factor")), .Names = c("A", "T", "X", "Y", "V", "thetarad",
"thetadeg", "bin"), row.names = c(NA, 6L), class = "data.frame")
I did not put a dput(head) for my NT3 because it will be too long.
I tried unsplit, unlist but am not successful. What I want to do is to have one data frame again with the bin as a factor.
Any help would be great.
edit: What I would like my data frame to have is A, ang, len, Vm, and bin as headers.
It's not altogether clear what your intended output is, but looking at what you have for NT3, this may be more effective:
NT <- data.table(ST1, key="A")
NT[, list(ang=length(unique(thetadeg))
, len=length(T)
, Vm=mean(V))
, by=list(A, bin) ]
I managed to find what I did wrong, so this now works:
NT <- data.table(st1, key="bin")
NT2=NT[, list(ang=length(unique(thetadeg)), len=length(T), Vm=mean(V)), by=c("A", "bin")]
Apparently I could already do in data.table the by statement which was also suggested by #Ricardo Saporta. Thank you for that!

Resources