I have a R dataFrame df with the following:
time value reference
45 10 11
22 12 10
13 15 5
I would like to replace the last column of the dataFrame to obtain:
time value space
45 10 11
22 12 10
13 15 5
I tried this:
colnames(length(colnames(df)))<-"space"
but it does not work. How can I do it?
You can use names() instead:
names(df)[length(names(df))]<-"space"
Marginally less typing:
names(df)[ncol(df)] <- "space"
The following should do what you need:
colnames(df)[ncol(df)] <- "space"
Related
I am building an asymmetrical matrix of values with the rows being coefficient names and the column the value of each coefficient:
# Set up Row and Column Names.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
output:
> matrix1
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
and I wish to extract certain rows that share row names (i.e. all rows with "unitBreaks'x'") into a submatrix.
I tried this
est_actsBreaks <- est_coef_mtrx[c("actsBreaks0","actsBreaks1","actsBreaks2","actsBreaks3",
"actsBreaks4","actsBreaks5","actsBreaks6","actsBreaks7",
"actsBreaks8","actsBreaks9"),c("Value")]
but it returns a vector and I need a matrix. I have seen other questions concerning similar procedures but their columns and rows all had identical names and/or values. Is there a way to do the operation I have in mind, such as grep()?
Welcome to StackOverflow.
As usual in R, there would probably be many ways to do what you request.
EDIT: I realized that my solution was going a little bit too far, sorry about that.
To extract only the rows that contain the pattern "unitBreaks" followed by several numbers, and still keep a matrix structure, you can run the following code. In a nutshell, grep is going to look for the pattern that you need and the argument drop = FALSE is going to make sure that you get a matrix as a result and not a vector.
uniBreakLines <- grep("unitBreaks[0-9]*", rows)
matrix1[uniBreakLines, , drop = FALSE]
Below is the first version of my answer.
First, I create a vector that describes the groups of rows. For this, I remove the numbers at the end of the row names.
grp <- gsub("[0-9]+$", "", rows)
Then, I transform the data matrix into a data-frame (why I do that is explained a little bit later).
dat1 <- data.frame(matrix1)
Finally, I use "split" on the data-frame, with the groups defined earlier. Using split on the data-frame will keep the structure: the result will be a list of data-frames, even though there is only one column.
dat1.split <- split(dat1, grp)
The result is a list of data-frames.
lapply(dat1.split, head)
$actsBreaks
Value
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
$covgBreaks
Value
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
$Intercept
Value
Intercept 1
$plugBump
Value
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
$tBreaks
Value
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
$unitBreaks
Value
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
$yearBreaks
Value
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
After that, if you still need matrices, you can convert them with the function as.matrix in an "lapply":
matrix1.split <- lapply(dat1.split, as.matrix)
You might want to consider combining your data in a "tibble" with the "grouping" column. You will then be able to use these groups with the group_by function or other functions from the dplyr package (or other packages from the tidyverse).
For example:
library(dplyr)
tib1 <- tibble(rows, simpler_rows, value = 1:48)
And an example on how to use the grouping variable:
tib1 %>%
group_by(simpler_rows) %>%
summarize(sum(value))
EDIT bis: what if I don't know the pattern?
I played around a little bit with your example to answer the question (that nobody asked, but still, it's fun!): "what if I don't know the pattern?"
In this case, I would use a distance between the row names. This distance would look like this:
... and would be the output of the following lines of code
library(stringdist)
library(pheatmap)
strdist <- stringdistmatrix(rows)
pheatmap(strdist, border_color = "white", cluster_rows = F, cluster_cols = FALSE, cellwidth = 10, cellheight = 10, labels_row = rows, fontsize_row = 7)
After that, I only need to get the number of cluster, which can be done with a silhouette plot (similar to this one), that tells me that there are 8 clusters of words, which seems about right:
The cluster can be extracted then with the function used to create the silhouette plot (I used hclust and cutree).
Here a solution with dplyr and stringr to extract rownames that contain a certain string.
At the end change back to matrix:
library(dplyr)
library(stringr)
df1 <- df %>%
filter(!str_detect(rownames(df), "unitBreaks"))
df1 <- as.matrix(df1)
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
I'm at a loss to cycle through a data frame and calculate a variable that is a function of different/multiple rows. Please see the following data as an example.
date var1 var2 var3
12/29/2013 10 34 0
12/30/2013 10 34 15
12/31/2013 8 27 15
1/1/2014 8 27 0
1/2/2014 2 7 10
1/3/2014 10 35 20
1/4/2014 13 45 10
I would like to create a variable that is a function of the current row and the next row. For example,
var4(12/31/2013) = var1(12/31/2013) + var2(1/1/2014) + var3(12/31/2013)
For the last element in the dataframe, there is no (n+1) variable, so I'd like to assign a missing value/exception value in that case. Any guidance you could provide would be wonderful. Thank you in advance!
you could try
library(dplyr)
df %>%
mutate(var4=var1+lead(var2)+var3)
I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).
I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)