I have a dataframe that I want to take only the values of one row, for all columns (as a numeric vector). One way of doing that would be df_trasposed = t(df), and then I can just take the wanted column with df_trasposed$column
I feel there is a better way of doing it, without creating a new data frame and taking more memory. I tried something like t(df)$column but this won't work obviously.
How can this be done?
Try this
as.numeric(df['rowname',])
Data frames are special types of lists, consisting of vectors of equal lengths. So we can treat it as lists and extract the nth element of each vector, where n is the row number of your data frame. Example:
df
# X1 X2 X3
# 1 1 5 9
# 2 2 6 10
# 3 3 7 11
# 4 4 8 12
sapply(df, `[`, 3)
# X1 X2 X3
# 3 7 11
You can wrap an unname(.) around it to delete element names, but this probably creates another copy in memory and actually is just cosmetics.
Data:
df <- data.frame(matrix(1:12, 4, 3))
Related
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
Given the below dataframe
df <- data.frame(cbind(seq(1:4),rep(letters[seq(1:3)],4)))
X1 X2
1 a
2 b
3 c
4 a
1 b
2 c
3 a
4 b
1 c
2 a
3 b
4 c
I would like to summarize unique X2s by X1. For example,
1 a,b,c
2 b,c,a
3 c,a,b
4 a,b,c
I am very close. I use the following code:
'summary <- aggregate(df$X2, list(df$X1),FUN=unique)`
which produces
Group.1 X
1 1,2,3
2 2,3,1
3 3,1,2
4 1,2,3
(the index of the list). What is the most efficient way to get my desired result?
I am certain there is an easy solution and I've tried searching, but I must not be using the correct search terms. Thank you in advanced.
We can use toString to paste the elements
aggregate(X2~X1, unique(df), toString )
Or if we need to keep it as list
aggregate(X2~X1, transform(unique(df), X2 = as.character(X2)), list)
As the OP also mentioned the efficient approach
library(data.table)
unique(setDT(df))[, .(X2 = toString(X2)), by = X1]
Regarding the creation of data.frame, it is easier, compact and error-free way to do without using cbind with data.frame. The main reason is that cbind converts to a matrix and matrix can have only a single class. So, if there is a single character column or elements, all the elements are converted to character. With as.data.frame, by default the stringsAsFactors=TRUE, so the columns are converted to factor class.
df <- data.frame(X1= 1:4, X2= rep(letters[1:3],4), stringsAsFactors= FALSE)
The above code gets the intended output. Note that seq is not needed when we use :
This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Closed 7 years ago.
I have the following data frame:
df <- data.frame(a=rep(1:3),b=rep(1:3),c=rep(4:6),d=rep(4:6))
df
a b c d
1 1 1 4 4
2 2 2 5 5
3 3 3 6 6
i would like to have a vector N which determines my window size so for thsi example i will set
N <- 1
I would like to split this dataframe into equal portions of N rows and store the 3 resulting dataframes into a list.
I have the following code:
groupMaker <- function(x, y) 0:(x-1) %/% y
testlist2 <- split(df, groupMaker(nrow(df), N))
The problem is that this code renames my column names by adding an X0. in front
result <- as.data.frame(testlist2[1])
result
X0.a X0.b X0.c X0.d
1 1 1 4 4
>
I would like a code that does the exact same thing but keeps the column names as they are. please keep in mind that my original data has a lot more than 3 rows so i need something that is applicable to a much larger dataframe.
To extract a list element, we can use [[. Also, as each list elements are data.frames, we don't need to explicitly call as.data.frame again.
testlist2[[1]]
We can also use gl to create the grouping variable.
split(df, as.numeric(gl(nrow(df), N, nrow(df))))
I have got the following problem. I have a data.frame with an x and y column representing some points in space:
X<-c(18.25743,18.25783,18.25823,18.25850,18.25863,18.25878,
18.25885,18.25912,18.25943,18.25962,18.25978,18.26000,
18.26022,18.26051,18.26070,18.26095,18.26118,18.26140,
18.26189,18.26250,18.26310,18.26390)
Y<-c(44.69561,44.69564,44.69567,44.69567,44.69586,
44.69600,44.69637,44.69671,44.69691,44.69701,44.69720,
44.69740,44.69763,44.69774,44.69787,44.69790,44.69791,
44.69795,44.69812,44.69802,44.69812,44.69834)
eDF<-data.frame(X,Y)
Now my problem is they are "sorted" wrong for plotting.So what I need is a function to write together the rows of the two points which belong together (in a list of lists):
1 and 12 is ID1
2 and 13 is ID2
3 and 14 is ID3
...
11 and 22 is ID11
Every so created list within the list of lists should have its unique ID (just numerating from 1 to the end). Well because I got this problem in all my data with different length.
It would be great if the starting point of the second consecutive row selecting (the 12) is flexible always taking the first row after half of the data.((rownumber/2)+1) in this example
12.
Well I have tried some things and i think Im on the right way but I cant figure out a solution by myself.
This function is pretty near but i cant manage to make it start at different rows(1 and 12):
lapply(2:nrow(eDF), function(x) eDF[(x-1):x,])
I also tried to figure it out with seq and it would do what i need if i could make a list of lists by connecting both code samples. Well I also need to change the concrete start and end numbers to a dynamic solution.
eDF[(seq(1,to=11,by=1)),] # selecting rows 1 to 11
eDF[(seq(12,to=nrow(eDF),by=1)),] #selecting rows 12 to end
Anyone any ideas?
I don't know if you needed an ID column inside of the new list but another way would be:
#create the IDs
eDF$ID <- rep(1:11,2)
#split the data.frame according to those
mylist <- split(eDF, eDF$ID)
Output:
mylist
$`1`
X Y ID
1 18.25743 44.69561 1
12 18.26000 44.69740 1
$`2`
X Y ID
2 18.25783 44.69564 2
13 18.26022 44.69763 2
$`3`
X Y ID
3 18.25823 44.69567 3
14 18.26051 44.69774 3
$`4`
X Y ID
4 18.2585 44.69567 4
15 18.2607 44.69787 4
#and so on...
You could only do split(eDF, rep(1:11,2) if you don't need the ID column.
We can modify the OP's lapply code
lapply(1:11, function(i) eDF[c(i, i+11),])
I have a data frame with a sequence of numeric columns, surrounded on both sides by (irrelevant) columns of characters. I want to obtain a new data frame that keeps the position of the irrelevant columns, and adds the numeric columns to eachother by a certain grouping vector (or applies some other row-wise function to the data frame, by group). Example:
sample = data.frame(cha1 = c("A","B"),num1=1:2,num2=3:4,num3=11:12,num4=13:14,cha2=c("C","D"))
> sample
cha1 num1 num2 num3 num4 cha2
1 A 1 3 11 13 C
2 B 2 4 12 14 D
with the goal to obtain
> goal
cha1 X1 X2 cha2
1 A 4 24 C
2 B 6 26 D
i.e. I've summed the 4 numeric columns according to the grouping vector gl(2,2,4) = (1,1,2,2) [levels: 1,2]
For a purely numeric data frame I've found the following method:
sample_num = sample[,2:5] #select numeric columns
data.frame(t(apply(sample_num,1,function(row) tapply(row, INDEX=gl(2,2,4),sum))))
I could combine this with re-inserting the character columns to give the intended result, but I'm really looking for a more elegant way. I'm particularly interested in a plyr method if there is one, as I'm trying to migrate to plyr for all my data frame manipulations. I imagine the first step would be to cast the data frame into long format, but I have no idea how to proceed from there.
One 'absolute' requirement is that I cannot do without the gl(n,k,l) method of grouping, as I need this to be applicable to a wide range of data frames and grouping factors.
EDIT: for simplicity assume that I know which columns are the relevant numeric columns. I'm not concerned with how to select them, I'm concerned with how to do my grouped sum without messing up the original data frame structure.
Thanks!
Grpindex<-gl(2,2,4)
goal<-cbind.data.frame(sample["cha1"],(t(rowsum(t(sample[,2:5]), paste0("X",Grpindex)))),sample["cha2"])
Output:
cha1 X1 X2 cha2
1 A 4 24 C
2 B 6 26 D