Unsplitting a data frame by a variable (different length of factors) - r

I have a data frame (st1) that I split by a factor. I then performed functions to the split data (i.e. mean) by another factor and hence, I cannot perform unsplit any more because my original data frame is of different length now.
As to walk you through what I did, here is a code:
NT = data.table(st1)
NT2=split (NT, NT$bin)
NT3 <- data.frame(sapply( NT2 , function(x) x[, list(ang=length(unique(thetadeg)), len=length(T), Vm=mean(V)), by=c("A")]))
head of the st1:
structure(list(A = c(25L, 25L, 25L, 25L, 25L, 25L), T = 56:61,
X = c(481.07, 487.04, 490.03, 499, 504.97, 507.96), Y = c(256.97,
256.97, 256.97, 256.97, 256.97, 256.97), V = c(4.482, 5.976,
7.47, 4.482, 5.976, 7.47), thetarad = c(0.164031585831919,
0.169139558949956, 0.171661200692621, 0.179083242584008,
0.183907246800473, 0.186289411097781), thetadeg = c(9.39831757286096,
9.69098287432395, 9.83546230358968, 10.2607139792383, 10.537109061132,
10.6735970214433), bin = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("binA", "binB", "binC", "outbin"), class = "factor")), .Names = c("A", "T", "X", "Y", "V", "thetarad",
"thetadeg", "bin"), row.names = c(NA, 6L), class = "data.frame")
I did not put a dput(head) for my NT3 because it will be too long.
I tried unsplit, unlist but am not successful. What I want to do is to have one data frame again with the bin as a factor.
Any help would be great.
edit: What I would like my data frame to have is A, ang, len, Vm, and bin as headers.

It's not altogether clear what your intended output is, but looking at what you have for NT3, this may be more effective:
NT <- data.table(ST1, key="A")
NT[, list(ang=length(unique(thetadeg))
, len=length(T)
, Vm=mean(V))
, by=list(A, bin) ]

I managed to find what I did wrong, so this now works:
NT <- data.table(st1, key="bin")
NT2=NT[, list(ang=length(unique(thetadeg)), len=length(T), Vm=mean(V)), by=c("A", "bin")]
Apparently I could already do in data.table the by statement which was also suggested by #Ricardo Saporta. Thank you for that!

Related

R Index error while trying to append multiple dataframes into one

So I have a massive dataframe and I'm trying to combine scores I calculated from multiple dataframes (about 17 dataframes) to this one dataframe and I need to do this process 12 different times. This is an example dataframe that I have
df=structure(list(ï..id = structure(c(2L, 7L, 5L, 4L, 3L, 1L, 6L,
8L), .Label = c("B12", "B7", "C2", "C9", "D3", "E2", "E6", "R4"
), class = "factor"), age = c(42L, 45L, 83L, 59L, 49L, 46L, 52L,
23L)), class = "data.frame", row.names = c(NA, -8L))
So I need to calculate network metrics using the igraph package. Here are 2 matrices I have with different people in them
net_mat1=structure(c("B7", "E6", "D3", "C9"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
net_mat2=structure(c("C2", "B12", "E2", "R4"), .Dim = c(2L, 2L), .Dimnames = list(
NULL, c("ï..target", "partner")))
Here is what I'm calculating
library(igraph)
g1=graph_from_edgelist(net_mat1)
g2=graph_from_edgelist(net_mat2)
degree_cent_close_1=centr_degree(g1, mode = "all")
degree.cent_close_1 #create object that contains metrics
degree.cent_close2=centr_degree(g2, mode = "all")
degree.cent_close2 #create another object that contains metrics
I then create dataframes that contain the metrics I calculated
cent_score_df1=data.frame(degree_cent_close_1$res, V(g1)$name)
cent_score_df1
cent_score_df2=data.frame(degree.cent_close2$res, V(g2)$name)
cent_score_df2
I then try to match and index the the values of these metrics back into the df dataframe doing this
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
df$centrality_scores
df$centrality_scores <- cent_score_df2[ match(df[['id']], cent_score_df2[['V.g2..name']] ) , 'degree.cent_close2.res']
df$centrality_scores
However, it seems each time I try to merge my data with the original dataframe it can only attach half the data. I can never attach both dataframes. Does anyone have a better method that works for re-attaching data? If there are faster and cleaner ways of doing this I would greatly appreciate the input
The problem with this line of code, is you are not selecting the rows in the original data.frame to update, instead you are just updating the first 4 rows.
df$centrality_scores <- cent_score_df1[ match(df[['id']], cent_score_df1[['V.g1..name']] ) , 'degree_cent_close_1.res']
What you intended was to do this:
df$centrality_scores<-NA
df$centrality_scores[na.omit(match(df$id, cent_score_df1$id))]<- cent_score_df1$V.g1..name
Another way to solve this is standardize the column names of your metric data frames and then use the merge function to add the results back to your original data frame.
names(cent_score_df1)<-c("centrality_scores", "id")
names(cent_score_df2)<-c("centrality_scores", "id")
cent_score<-rbind(cent_score_df1, cent_score_df2)
merge(df, cent_score, by.x="id", by.y="id")

remove double quotes from factors in a dataframe

I got a dataframe to work on where I have a bunch of variables as factors in quotation marks like ""x1"".
str(df) gives me something like this:
$ x : Factor w/ 10 Levels "\"\"x1\"\"",..: 1 7 9 ...
I tried to get rid of the quotation marks with the gsub() function but that didn´t work. Probably because I don´t know what to insert as pattern? Would be great if somebody can solve this puzzle and maybe explain to me if the "\"\"x1\"\"" is the solution to this?
An example for the dataframe would look like this:
structure(list(Sent = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("\"\"Opted out\"\"",
"\"\"Yes\"\""), class = "factor"), Responded = structure(c(2L,
2L, 2L, 2L, 2L), .Label = c("\"\"Complete\"\"", "\"\"No\"\"",
"\"\"Partial\"\""), class = "factor")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Sent",
"Responded"))
Thanks in advance!
vec = c('""x1""', '""x2""', '""x3""')
vec = factor(vec)
levels(vec) <- gsub('["\\]', "", levels(vec))
#> vec
#[1] x1 x2 x3
#Levels: x1 x2 x3
See how I would use ' as wrapper, when I want to use " inside a string.
Another problem it didn't work for you was probably because you didn't use the levels attribute but rather the factor variable itself.
Factor variables are internally stored as 1, 2, 3,... numbers.
As you now have provided data, you can use: (df1 being your data with the factor columns)
df1[] <- lapply(df1, function(vec){ levels(vec) <- gsub('["\\]',"",levels(vec)); vec})

Merging three factors so their dependent variable sums in R

Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

R: Select value from a different column for each row

I have a large data frame (cut down to first 5 rows here) comprised of radio-telemetry readings from multiple antennas. Normally there are 10,000+ rows of data like this every couple of weeks.
structure(list(freq.id = c(13, 13, 13, 13, 13), DT = structure(c(1393835337,
1393921137, 1393879437, 1393881387, 1393920987), class = c("POSIXct",
"POSIXt"), tzone = "America/Bogota"), S1 = c(-13624L, -12866L,
-13291L, -13415L, -13002L), N1 = c(-13969L, -13824L, -13868L,
-13881L, -13911L), S2 = c(-14114L, -14026L, -13957L, -13969L,
-14052L), N2 = c(-14211L, -14238L, -14168L, -14148L, -14211L),
S3 = c(-13245L, -13113L, -12801L, -12860L, -13133L), N3 = c(-13816L,
-13832L, -13878L, -14001L, -13706L), S4 = c(-13479L, -12702L,
-12388L, -12501L, -12692L), N4 = c(-13872L, -13820L, -13992L,
-13905L, -13798L), S5 = c(-12516L, -11485L, -10871L, -10900L,
-11452L), N5 = c(-13884L, -13995L, -13804L, -13840L, -13929L
), S6 = c(-12661L, -12168L, -10982L, -11112L, -12164L), N6 = c(-13911L,
-13914L, -13078L, -13778L, -13911L), PW = c(20L, 20L, 20L,
20L, 21L), PI = c(1078L, 1078L, 1080L, 2156L, 1078L), aru.unk = c(2072L,
2058L, 2014L, 2052L, 2047L), msrfreq = c(164421600L, 164421700L,
164421400L, 164421300L, 164421800L), TOWERID = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("TOWER4", "TOWER5", "TOWER6",
"TOWER7"), class = "factor"), prog.freq = structure(c(9L,
9L, 9L, 9L, 9L), .Label = c("162.7920", "162.9774", "163.0780",
"163.6804", "163.8600", "164.0309", "164.2930", "164.3950",
"164.4220", "164.4350", "164.5040", "164.5430", "164.5620",
"164.7026", "164.7840", "164.8230", "164.8430", "164.9338",
"165.5000"), class = "factor")), .Names = c("freq.id", "DT",
"S1", "N1", "S2", "N2", "S3", "N3", "S4", "N4", "S5", "N5", "S6",
"N6", "PW", "PI", "aru.unk", "msrfreq", "TOWERID", "prog.freq"
), row.names = 40615:40619, class = "data.frame")
Columns S1,S2...S6 are signal values from different antennas and N1,N2...N6 are corresponding noise values
I am trying to pull out the largest and second largest signal values for each row and their corresponding noise values. I can get the the signal values, as well as it's "index" of just the columns of signal.
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
mydata$strongest<-apply(mydata[,c(3,5,7,9,11,13)],1,function(x) x[maxn(1)(x)])
#columns 3,5,6,11,13 are the subset of columns containing signal values
mydata$secondstrongest<-apply(mydata[,c(3,5,7,9,11,13)],1,function(x) x[maxn(2)(x)])
mydata$strongestantenna<-apply(mydata[,c(3,5,7,9,11,13)],1,maxn(1))
# returns 5 because in the first 5 rows, the strongest signal is the 5th antenna (S5)
mydata$secondstrongestantenna<-apply(mydata[,c(3,5,7,9,11,13)],1,maxn(2))
#returns a 6
I'm stuck trying to create 2 new columns that extract the noise values for the antennas that have the 1st and 2nd strongest signals. I was hoping to use the place index (1-6) for each antenna to pull out the correct noise values like this, but it isn't working. It pulls the correct value, but repeats it the same number of times as the value of mydata$strongantenna
mydata$strongantennanoise<-mydata[c(4,6,8,10,12,14)][mydata$strongestantenna],
#Columns 4,6,8,10,and 12 are the noise values
The strongest and second strongest antennas don't change here, but do in the data, as the animal being tracked moves around.
I feel like I'm overlooking something simple, but I can't figure it out. I appreciate whatever help you can give.
# Get names of the strongest and second strongest antennas by row:
strongest <- apply(mydata[,c(3,5,7,9,11,13)],1, function(x) names(x[maxn(1)(x)]))
secondstrongest <- apply(mydata[,c(3,5,7,9,11,13)],1, function(x) names(x[maxn(2)(x)]))
# Get column index for associated noise columns
biggest.noise.col <- sapply(seq_along(mydata[,1]),
function(x) which(colnames(mydata) == strongest[x]) +1)
second.biggest.noise.col <- sapply(seq_along(mydata[,1]),
function(x) which(colnames(mydata) == secondstrongest[x]) +1)
# Use the indices to extract relevant noise values:
mydata$strongestantennanoise <- sapply(seq_along(mydata[,1]),
function(x) mydata[x, biggest.noise.col[x]])
mydata$secondstrongestantennanoise <- sapply(seq_along(mydata[,1]),
function(x) mydata[x, second.biggest.noise.col[x]])
May be you can also try:
dat1 <- dat[,grep("S", colnames(dat))]
Strongest <- do.call(`pmax`, dat1)
Strongest
#[1] -12516 -11485 -10871 -10900 -11452
indx1 <-which(dat1==Strongest,arr.ind=T)
indx11 <- unique(indx1[,2])
SecondStrongest <- do.call(`pmax`, dat1[,-indx])
SecondStrongest
#[1] -12661 -12168 -10982 -11112 -12164
indx2 <- which(SecondStrongest ==dat1,arr.ind=TRUE)
dat2 <- dat[,grep("N", colnames(dat))]
MatchingNoise <- dat2[indx1]
MatchingSecondNoise <- dat2[indx2]

Resources