How to rbind data from differing lengths of tables - r

Ok, I am sure there is a simple solution to this. Assuming the following data
A <- 1:10
B <- rep("Part A", 10)
C <- 10:19
df1 <- data.frame(A,B,C)
A <- 1:9
B <- rep("Part B", 9)
D <- 20:28
df2 <- data.frame(A,B,D)
Now I want to create df3 which the user specifies which column names.
So df3 should be a 2*19 data frame of only A and B
This does not work
df3 <- rbind(df1[A,B], df2[A,B])
I dont want to use common_cols or [,x] function as my real dataset has over 1000 variables, that are not always in the same order.

Your syntax isn't quite right for subsetting. Try
cols <- c("A", "B")
rbind(df1[,cols], df2[,cols])
The columns you want to keep should be a vector of names (or indices/logicals) after the ,.

Related

Select columns based on another column in a different data frame in R

I have a df:
AA <- c("GA","GA", "GA","GA","GA")
A <- c(1,2,3,4,5)
B <- c(5,4,3,2,1)
C <- c(2,3,4,5,1)
D <- c(4,3,2,1,5)
df <- data.frame(AA, A, B, C, D)
The other df is:
E <- c("B", "D")
F <- c("GA","GA")
df2 <- data.frame(E, F)
I would like to only select the columns from df based on the values from df2$E.
And that data frame would look like this:
AA <- c("GA","GA", "GA","GA","GA")
B <- c(5,4,3,2,1)
D <- c(4,3,2,1,5)
df3 <- data.frame(AA, B, D)
My current code below gives me a empty data frame with 0 obs and 5 variables
df3 <- df %>% filter(df %in% df2$E)
Any assistance in generating a code that works would be greatly appreciated.
Thank you!
Here we can index via column names.
df[,c("AA",df2$E)]

Preventing variable renaming while creating new dataframe from another data frame

I have a very simple dataset like this,
a <- c(29, 10, 29)
b <- c(32, 23, 43)
c <- c(33,22,1)
df1 <- data.frame(a, b, c)
I want to create a new data frame from vector a and c from df1. I am runing the following command,
df2 <- data.frame(df1$a, df1$c)
It is creating a data frame with variable name df.aand df.c. Is there any way I can have the variable name exactly like what I have in df1?
df2 <- data.frame(a=df1$a, c=df1$c)
a b
1 29 33
2 10 22
3 29 1
I assume your a, b, c variables are not directly available anymore
colnames(df2) <- c("a", "c")
should do the trick?
df1[,c("a","c")]
In case you select only column: df1[,"a",drop=FALSE].
Always include drop=FALSE to handle the general case:
selectedColumns <- c("a","c")
df1[, selectedColumns, drop=FALSE]
If your real application is more complex than just taking a subset (which seems an obviously good solution), you can use setNames (here it doesn't make much sense, but it could help if you are trying to automatically rename the data frame at construction...):
df2 <- setNames(df1[, c('a', 'b')], names(df1[, c('a', 'b')]) )

reference x's column in R's apply function

I have a df like this:
a <- c(4,5,3,5,1)
b <- c(8,9,7,3,5)
c <- c(6,7,5,4,3)
df <- data.frame(rbind(a,b,c))
I want a new df, df2, containing the difference between the values in each cell in rows a and b and the value in row c in their respective columns.
df2 would look like this:
a <- c(-2,-2,-2,1,-2)
b <- c(2,2,2,-1,2)
df2 <- data.frame(rbind(a,b))
Here is where I'm getting stuck:
df2 <- data.frame(apply(df,c(1,2),function(x) x - df[nrow(df),the col index of x]))
How do I reference the column index of x? Is there something like JavaScript's this?
We can do this easily by replicating the 3rd row to make the lengths equal before subtracting with the first two rows
out <- df[c("a", "b"),] - df["c",][col(df[c("a", "b"),])]
identical(df2, out)
#[1] TRUE
Or explicitly using rep
df[c("a", "b"),] - rep(unlist(df["c",]), each = 2)

R: Looping through list of dataframes in a vector

I have a dataset where I only want to loop through certain columns in a dataframe one at a time to create a graph. The structure of my dataframe consists of data that I parsed from a larger dataset into a vector containing multiple dataframes.
I want to call one column from one dataframe in the vector. I want to loop on the dataframes to call each column.
See example below:
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4))
my.list <- list(d1, d2)
All I have to work with is my.list
How would I do this?
You can use lapply to plot each of the individual data frames in your list. For example,
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4),y3=c(11,12,13))
mylist <- list(d1, d2)
par(mfrow=c(2,1))
# lapply on a subset of columns
lapply(mylist, function(x) plot(x$y2, x$y3))
You don't need a for loop to get their data points. You can call the column by their column names.
# a toy dataframe
d <- data.frame(A = 1:20, B = sample(c(FALSE, TRUE), 20, replace = TRUE),
C = LETTERS[1:20], D = rnorm(20, 0, 1))
col_names <- c("A", "B", "D") # names of columns I want to get
d[,col_names] # returns a dataset with the values of the columns you want
Here is a solution to your problem using a for loop:
# a toy dataframe
mylist <- list(dat1 = data.frame(A = 1:20, B = LETTERS[1:20]),
dat2 = data.frame(A = 21:40, B = LETTERS[1:20]),
dat3 = data.frame(A = 41:60, B = LETTERS[1:20]))
col_names <- c("A") # name of columns I want to get
for (i in 1:length(mylist)){
# you can do whatever you want with what is returned;
# here I am just print them out
print(names(mylist)[i]) # name of the data frame
print(mylist[[i]][,col_names]) # values in Column A
}
I think the simplest answer to your question is to use double brackets.
for (i in 1:length(my.list)) {
print(my.list[[i]]$column)
}
That works assuming all of the columns in your list of data frames have the same names. You could also call the position of the column in the data frame if you wanted.
Yes, lapply can be more elegant, but in some situations a for loop makes more sense.

R: control auto-created column names in call to rbind()

If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)

Resources