In the case that a=matrix(c(1,2,3,4),nrow=2,ncol=2) and b=c('name',3). I am trying to merge a and b such that the outcome is [1 3 name 3] in the first row and [2 4] in the second row.
The number of rows differs in each dataframe. Therefore cbind is going to have a hard time merging the data and will by default loop the shorter dataframe, in this case b.
I would suggest adding in the rowname as a column and then binding on that. By default, full_join will then generate NA values for dataframes missing that value of the bind. This question is partially a duplicate of Add (not merge!) two data frames with unequal rows and columns so you may find more help there.
# Load packages
library(tidyverse)
library(magrittr) # To use the inplace assignment operator (%<>%)
# Create dataframes
a <- data.frame(1:2,3:4)
b <- merge('name', 3)
# Create rowname column for each dataframe
a %<>% tibble::rownames_to_column()
b %<>% tibble::rownames_to_column()
# Use 'full join' to bind dataframes together
c <- dplyr::full_join(a, b, by=rowname) %>%
# Remove the rowname column
dplyr::select(-rowname)
# Print c
print(c)
X1.2 X3.4 x y
1 1 3 name 3
2 2 4 <NA> NA
If you are satisfied with a list, not data frame, this will work.
a <- matrix(c(1,2,3,4),nrow=2,ncol=2)
b <- c('name',3)
c <- list(a[,1],a[,2],b[1],b[2] )
If you need a data frame,
you have to make the 1st and 2nd row have the same number of columns, by stuffing the gaps with something.
d <- as.data.frame(c)
d[2,3:4] <- NA
Related
I have a large data frame that I coerced as a tibble to be able to use the dplyr package. I wanted to know if there was a way to "replace" a column in the tibble with the same variable in a different notation.
I have tried the mutate() function but I don't want to add a new column to the tibble, just replace a column with a vector of the same length.
You just need to set the name of your variable inside the mutate. For example, if you want to divide by 100:
mutate(var = var/100)
If I understand your question correctly, I think the answer is mutate()!
> library(dplyr)
> d <- tibble(x=1:3,y=2:4)
> d <- d %>% mutate(x=8:10) ## replace column x
> d
# A tibble: 3 x 2
x y
<int> <int>
1 8 2
2 9 3
3 10 4
I am trying to rename columns but I do not know if that column will be present in the dataset. I have a large data set and if a certain column name is present I want to rename it. For example:
A B C D E
1 4 5 9 2
3 5 6 9 1
4 4 4 9 1
newNames <- data %>% rename(1=A,2=B,3=C,4=D,5=E)
This works to rename what is in the dataset but I am looking for the flexibility to add more potential name changes, without an error occurring.
newNames2 <- data %>% rename(1=A,2=B,3=C,4=D,5=E,6=F,7=G)
This ^ will not work it give me an error because F and G are not in the data set.
Is there any way to write a code to ignore the column change if the name does not exist?
Thanks!
There can be plenty of ways to do this. One would be to create a named vector with the names and their corresponding 'new name' (as the vector's names) and use that, i.e.
#The below vector v1, uses LETTERS as old names and 1:7 as the new ones
v1 <- setNames(LETTERS[1:7], 1:7)
names(df) <- names(v1)[v1 %in% names(df)]
I have a data.frame that can contains N columns (N defined at runtime), and I want to get the rows within the data frame that satisfy N-1 conditions, in other words I want to get only the rows with a specific value for the first N-1 columns.
For instance if I have a data frame with four columns (A,B,C,D) and five rows:
A B C D
1 2 3 4
9 9 9 9
1 2 9 5
4 3 2 1
1 2 3 8
I would get all the rows with A==1 & B==2 & C==3, i.e:
A B C D
1 2 3 4
1 2 3 8
But as said, the data frame can be composed of any amount of rows and columns (defined at runtime), and the values of the conditions may change.
I implemented this function (simplified):
getRows<-function(dataFrame, values) {
conditions=rep(TRUE, dim(dataFrame)[1])
for (k in 1:length(values)) {
conditions=conditions&(dataFrame[,k]==values[k])
}
return(dataFrame[conditions,])
}
Of course, this assumes the values in the values vector are sorted with respect to the columns order of the data frame, and that the length of the vector is N-1.
The function works but I've the feeling that it is not really efficient to create the vector of boolean, evaluate boolean expressions in this way and so on... especially if the data frame contains many data.
Another solution that I found is:
getRows<-function(dataFrame, values) {
tmp=dataFrame
for (k in 1:length(values)) {
tmp=tmp[tmp[,k]==values[k],]
}
return(tmp)
}
Basically this 'reduces' the data frame by filtering out all the rows that not satisfy each condition. But I think this is even worst, because it creates a new data frame object for each condition (ok always smaller, but anyway...).
So my question is: is there a method to do that more efficiently?
one possibility:
# if you are only checking for equalities
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), sep="==", collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(1:3, "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
# you can also use other conditions
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(paste0(c("==", "<=", "=="), 1:3), "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
Sometimes matrices are quicker than data.frames to operate on, so something along the lines of:
mat <- t(as.matrix(df[-ncol(df)))
boolMat <- (mat==values) # if necessary use match to reorder values to match columns of df
ind <- colSums(boolMat)==nrow(boolMat)
df[ind,]
The idea being that values will get recycled along the columns of the matrix (which are the rows of the dataframe). colSums is meant to be quicker than an apply, so the final line should be somewhat optimised compared to apply(boolMat, 2, all).
The optimal solutions will depend on the size and proportions of the data; whether the entries are all integers; and maybe what proportion of matches you get in the data. So as #droopy mentions, you'll need to benchmark. My approach involves creating a copy of the data, so if your data is already approaching memory limits, then it might struggle - but maybe then you could generate your data in matrix rather than data.frame format to save the duplication.
I am trying to a simple task, and created a simple example. I would like to add the counts of a taxon recorded in a vector ('introduced',below) to the counts already measured in another vector ('existing'), according to the taxon name. However, when there is a new taxon (present in introduced by not in existing), I would like this taxon and its count to be added as a new entry in the matrix (doesn't matter what order, but name needs to be retained).
For example:
existing<-c(3,4,5,6)
names(existing)<-c("Tax1","Tax2","Tax3","Tax4")
introduced<-c(2,2)
names(introduced)<-c("Tax1","Tax5")
I want new matrix, called "combined" here, to look like this:
#names(combined)= c("Tax1","Tax2","Tax3","Tax4","Tax5")
#combined= c(5,4,5,6,2)
The main thing to see is that "Tax1"'s values are combined (3+2=5), "Tax5" (2) is added on to the end
I have looked around but previous answers similar to this have much more complex data and it is difficult to extract which function I need. I have been trying combinations of match and which, but just cannot get it right.
grp <- c(existing,introduced)
tapply(grp,names(grp),sum)
#Tax1 Tax2 Tax3 Tax4 Tax5
# 5 4 5 6 2
Instead of keeping your data in 'loose' vectors, you may consider collecting them in one data frame. First, put you two sets of vector data in data frames:
existing <- c(3, 4, 5, 6)
taxon <- c("Tax1", "Tax2", "Tax3", "Tax4")
df1 <- data.frame(existing, taxon)
introduced <- c(2, 2)
taxon <- c("Tax1", "Tax5")
df2 <- data.frame(introduced, taxon)
Then merge the two data frames by the common column, 'taxon'. Set all = TRUE to include all rows from both data frames:
df3 <- merge(df1, df2, all = TRUE)
Finally, sum 'existing' and 'introduced' taxon, and add the result to the data frame:
df3$combined <- rowSums(df3[ , c("existing", "introduced")], na.rm = TRUE)
df3
# taxon existing introduced combined
# 1 Tax1 3 2 5
# 2 Tax2 4 NA 4
# 3 Tax3 5 NA 5
# 4 Tax4 6 NA 6
# 5 Tax5 NA 2 2
I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g