I have a very simple dataset like this,
a <- c(29, 10, 29)
b <- c(32, 23, 43)
c <- c(33,22,1)
df1 <- data.frame(a, b, c)
I want to create a new data frame from vector a and c from df1. I am runing the following command,
df2 <- data.frame(df1$a, df1$c)
It is creating a data frame with variable name df.aand df.c. Is there any way I can have the variable name exactly like what I have in df1?
df2 <- data.frame(a=df1$a, c=df1$c)
a b
1 29 33
2 10 22
3 29 1
I assume your a, b, c variables are not directly available anymore
colnames(df2) <- c("a", "c")
should do the trick?
df1[,c("a","c")]
In case you select only column: df1[,"a",drop=FALSE].
Always include drop=FALSE to handle the general case:
selectedColumns <- c("a","c")
df1[, selectedColumns, drop=FALSE]
If your real application is more complex than just taking a subset (which seems an obviously good solution), you can use setNames (here it doesn't make much sense, but it could help if you are trying to automatically rename the data frame at construction...):
df2 <- setNames(df1[, c('a', 'b')], names(df1[, c('a', 'b')]) )
Related
I am trying to work out how to create a user defined function to perform a calculation on a series of columns in a dataframe, and add the answer as an additional column to the same dataframe. To keep things simple, the test example I have been using is to calculate percentage growth from one year to the next, but the goal is to be able to create more elaborate calculations that are too cumbersome and repetitive to manually calculate.
The practice data I have been using is...
a <- c(10, 12)
b <- c(11, 9)
df <- t(data.frame(a, b))
df <- data.frame(df)
colnames(df) <- c(2001, 2002))
Which will look like...
2001 2002
a 10 12
b 11 9
The manual calculation I have been using is...
df$PercGrowth <- (df$`2002` - df$`2001`) / df$`2001` * 100
Which returns:
2001 2002 PercGrowth
a 10 12 20.00000
b 11 9 -18.18182
How do I turn this into a user defined function where I can specify the columns to perform the calculation, and then have the answer added to the dataframe as a derived value?
What I initially thought might work was...
pg <- function(data, c1, c2)
df <- mutate(data, PercGrowth = ((df[c2] -df[c1]) / df[c1] * 100))
pg(df, 1, 2)
However I keep getting the error message:
Error: Column PercGrowth is of unsupported class data.frame
How do I get this to work?
This is actually more complicated than it looks - you need to use dplyr pronouns and quasiquotation in order to pass the column names as arguments in the function. The following code works:
library(dplyr)
a <- c(10, 12)
b <- c(11, 9)
df <- t(data.frame(a, b))
df <- data.frame(df)
colnames(df) <- c("year1", "year2")
pg <- function(df, col1, col2) {
quo_col1 <- enquo(col1)
quo_col2 <- enquo(col2)
df %>%
mutate(pct_growth = (!! quo_col2 - !! quo_col1) / !! quo_col1 * 100)
}
pg(df, year1, year2)
I renamed the columns to strings so they are easier to work with. You can read more at this link: https://dplyr.tidyverse.org/articles/programming.html
Another option could be to use some kind of string matching on the column names you're interested in, perform operations using those columns, and then join the result back to the main data frame.
Ok, I am sure there is a simple solution to this. Assuming the following data
A <- 1:10
B <- rep("Part A", 10)
C <- 10:19
df1 <- data.frame(A,B,C)
A <- 1:9
B <- rep("Part B", 9)
D <- 20:28
df2 <- data.frame(A,B,D)
Now I want to create df3 which the user specifies which column names.
So df3 should be a 2*19 data frame of only A and B
This does not work
df3 <- rbind(df1[A,B], df2[A,B])
I dont want to use common_cols or [,x] function as my real dataset has over 1000 variables, that are not always in the same order.
Your syntax isn't quite right for subsetting. Try
cols <- c("A", "B")
rbind(df1[,cols], df2[,cols])
The columns you want to keep should be a vector of names (or indices/logicals) after the ,.
lets say I have a data.table with columns A, B and C
I'd like to write a function that applies a filter (for example A>1) but "A" needs to be dynamic (the function's parameter) so if I inform A, it does A>1; If I inform B, it does B>1 and so on... (A and B always being the columns names, of course)
Example:
Lets say my data is bellow, I'd like to do "A==1" and it would return the green line, or do "B==1 & C==1" and return the blue line.
Can this be done?
thanks
You can try
f1 <- function(dat, colName){dat[eval(as.name(colName))>1]}
setDT(df1)
f1(df1, 'A')
f1(df1, 'B')
If you need to make the value also dynamic
f2 <- function(dat, colName, value){dat[eval(as.name(colName))>value]}
f2(df1, 'A', 1)
f2(df1, 'A', 5)
data
set.seed(24)
df1 <- data.frame(A=sample(-5:10, 20, replace=TRUE),
B=rnorm(20), C=LETTERS[1:20], stringsAsFactors=FALSE)
Try:
dt = data.table(A=c(1,1,2,3,1), B=c(4,5,1,1,1))
f=function(dt, colName) dt[dt[[colName]]>1,]
#> f(dt, 'A')
# A B
#1: 2 1
#2: 3 1
If your data is
a <- c(1:9)
b <- c(10:18)
# create a data.frame
df <- data.frame(a,b)
# or a data.table
dt <- data.table(a,b)
you can store your condition(s) in a variable x
x <- quote(a >= 3)
and filter the data.frame using dplyr (subsetting with [] won't work)
library(dplyr)
filter(df, x)
or using data.table as suggested by #Frank
library(data.table)
dt[eval(x),]
Why write a function? You can do this...
Specifically:
d.new=d[d$A>1,]
where d is the dataframe d$A is the variable and d.new is a new dataframe.
More generally:
data=d #data frame
variable=d$A #variable
minValue=1 #minimum value
d.new=data[variable>minValue,] #create new data frame (d.new) filtered by min value
To create a new column:
If you don't want to actually create a new dataframe but want to create an indicator variable you can use ifelse. This is most similar to coloring rows as shown in your example. Code below:
d$indicator1=ifelse(d$X1>0,1,0)
Suppose I have the following data.frames:
library(dplyr)
set.seed(13)
df <- data_frame(A = sample(letters[1:2], 6, rep=TRUE), B = sample(1:3, 6, rep = TRUE))
new_df <- data_frame(A ="a", B = 4)
Suppose I want to update all the rows of df where A == "a" with the value 4 (This is an example, in general df has more than one row). I can do this the following way:
df %>% left_join(new_df %>% rename(b=B)) %>% mutate(B = ifelse(is.na(b), B, b))
Which is fine, but this does not look elegant. Is there a better way to do this?
I came across this issue by cleaning up the data. I calculate certain column from another column, which should be unique id, but due to data collection issues it is not. I have another table with the correct ids, and I want to update them. Usually the number of incorrect ids is low compared to number of correct ids, so doing join seems like an overkill.
Well, if you're looking for elegant (and fast), here's how you can replace those values in-place:
library(data.table)
dt = as.data.table(df) # alternatively call setDT to convert in-place
setkey(dt, A)
dt[new_df, B := i.B]
dt
# A B
#1: a 4
#2: a 4
#3: a 4
#4: a 4
#5: b 2
#6: b 2
Two notes. You will get warnings, as data.table is very careful about types and the types of your two tables don't match. Second note - the i. ensures that you use the B column of the i-expression, i.e. the first argument of [.data.table, and is used to resolve conflicts such as here.
It doesn't require dplyr but how about:
df$B <- ifelse (df$A=="a",4,df$B)
If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)