Operation between two dataframe with different size in R - r

I'd like to sum two dataframe with different size in R.
> x = data.frame(a=c(1,2,3),b=c(5,6,7))
> y = data.frame(x=c(1,1,1))
> x
a b
1 1 5
2 2 6
3 3 7
> y
x
1 1
2 1
3 1
The result I want is,
>
a b
1 2 6
2 3 7
3 4 8
How can I do this?

Maybe easiest to convert y to a vector with unlist and then perform the operation. Here, the vector in unlist(y) will be recycled over the columns of the data.frame x.
x + unlist(y)
a b
1 2 6
2 3 7
3 4 8
As a side note, data.frames are a special type of list object and sometimes performing operations on lists can be a bit more involved. On the otherhand, they tend to work fairly well with vectors as long as the dimensions line up (here, as long as the vector has the same length as the number of rows in the data.frame).

We can make the dimensions same and then get the sum
x + rep(y, ncol(x))
# a b
#1 2 6
#2 3 7
#3 4 8
Or another option is sweep
sweep(x, y$x, 1, `+`)
# a b
#1 2 6
#2 3 7
#3 4 8

Related

How do I add observations to an existing data frame column?

I have a data frame. Let's say it looks like this:
Input data set
I have simulated some values and put them into a vector c(4,5,8,8). I want to add these simulated values to columns a, b and c.
I have tried rbind or inserting the vector into the existing data frame, but that replaced the existing values with the simulated ones, instead of adding the simulated values below the existing ones.
x <- data.frame("a" = c(2,3,1), "b" = c(5,1,2), "c" = c(6,4,7))
y <- c(4,5,8,8)
This is the output I expect to see:
Output
Help would be greatly appreciated. Thank you.
Can do:
as.data.frame(sapply(x,
function(z)
append(z,y)))
a b c
1 2 5 6
2 3 1 4
3 1 2 7
4 4 4 4
5 5 5 5
6 8 8 8
7 8 8 8
An option is assignment
n <- nrow(x)
x[n + seq_along(y), ] <- y
x
# a b c
#1 2 5 6
#2 3 1 4
#3 1 2 7
#4 4 4 4
#5 5 5 5
#6 8 8 8
#7 8 8 8
Another option is replicate the 'y' and rbind
rbind(x, `colnames<-`(replicate(ncol(x), y), names(x)))
x[(nrow(x)+1):(nrow(x)+length(y)),] <- y

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Extract data from data.frame based on coordinates in another data.frame

So here is what my problem is. I have a really big data.frame woth two columns, first one represents x coordinates (rows) and another one y coordinates (columns), for example:
x y
1 1
2 3
3 1
4 2
3 4
In another frame I have some data (numbers actually):
a b c d
8 7 8 1
1 2 3 4
5 4 7 8
7 8 9 7
1 5 2 3
I would like to add a third column in first data.frame with data from second data.frame based on coordinates from first data.frame. So the result should look like this:
x y z
1 1 8
2 3 3
3 1 5
4 2 8
3 4 8
Since my data.frames are really big the for loops are too slow. I think there is a way to do this with apply loop family, but I can't find how. Thanks in advance (and sorry for ugly message layout, this is my first post here and I don't know how to produce this nice layout with code and proper data.frames like in another questions).
This is a simple indexing question. No need in external packages or *apply loops, just do
df1$z <- df2[as.matrix(df1)]
df1
# x y z
# 1 1 1 8
# 2 2 3 3
# 3 3 1 5
# 4 4 2 8
# 5 3 4 8
A base R solution: (df1 and df2 are coordinates and numbers as data frames):
df1$z <- mapply(function(x,y) df2[x,y], df1$x, df1$y )
It works if the last y in the first data frame is corrected from 5 to 4.
I guess it was a typo since you don't have 5 columns in the second data drame.
Here's how I would do this.
First, use data.table for fast merging; then convert your data frames (I'll call them dt1 with coordinates and vals with values) to data.tables.
dt1<-data.table(dt)
vals<-data.table(vals)
Second, put vals into a new data.table with coordinates:
vals_dt<-data.table(x=rep(1:dim(vals)[1],dim(vals)[2]),
y=rep(1:dim(vals)[2],each=dim(vals)[1]),
z=matrix(vals,ncol=1)[,1],key=c("x","y"))
Now merge:
setkey(dt1,x,y)[vals_dt,z:=z]
You can also try the data.table package and update df1 by reference
library(data.table)
setDT(df1)[, z := df2[cbind(x, y)]][]
# x y z
# 1: 1 1 8
# 2: 2 3 3
# 3: 3 1 5
# 4: 4 2 8
# 5: 3 4 8

Comparing 3 column lists in R

I am looking for a loop in R which can compare a three column matrix. What I'd like the loop to do is take two matrices/dataframes of the format
A:
X Y Z
3 4 5
3 5 6
4 5 7
and
B:
X Y Z
3 4 5
3 4 4
3 4 7
4 5 7
and loop through each of these returning the row if all of the 3 columns are exact matches.
This would return
X Y Z
3 4 5
4 5 7
Ideally the code could be applied to longer matrices/data frames.
You could try intersect from dplyr if "A", "B" are "data.frames"
library(dplyr)
intersect(A,B)
Or
inner_join(A,B)
An option in base R is merge:
merge(A,B, by = c("X", "Y", "Z"))
# X Y Z
#1 3 4 5
#2 4 5 7
This approach works the same way with matrices and data.frames.
And in fact, you could even leave out the specification of the "by" argument in this case:
merge(A,B)
# X Y Z
#1 3 4 5
#2 4 5 7
If your data are big you can use data.table package:
library(data.table); setDT(A); setDT(B)
setkey(B)[A,nomatch=0]
# X Y Z
#1: 3 4 5
#2: 4 5 7
setkey(A)[B,nomatch=0] returns the same output since it is an intersection.

sort and number within levels of a factor in r

if i have the following data frame G:
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
I am trying to get:
z type x y
3 a 6 3
2 a 5 2
1 a 4 1
4 b 1 2
5 b 0.9 1
6 c 4 1
I.e. i want to sort the whole data frame within the levels of factor type based on vector x. Get the length of of each level a = 3 b=2 c=1 and then number in a decreasing fashion in a new vector y.
My starting place is currently with sort()
tapply(y, x, sort)
Would it be best to first try and use sapply to split everything first?
There are many ways to skin this cat. Here is one solution using base R and vectorized code in two steps (without any apply):
Sort the data using order and xtfrm
Use rle and sequence to genereate the sequence.
Replicate your data:
dat <- read.table(text="
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
", header=TRUE, stringsAsFactors=FALSE)
Two lines of code:
r <- dat[order(dat$type, -xtfrm(dat$x)), ]
r$y <- sequence(rle(r$type)$lengths)
Results in:
r
z type x y
3 3 a 6.0 1
2 2 a 5.0 2
1 1 a 4.0 3
4 4 b 1.0 1
5 5 b 0.9 2
6 6 c 4.0 1
The call to order is slightly complicated. Since you are sorting one column in ascending order and a second in descending order, use the helper function xtfrm. See ?xtfrm for details, but it is also described in ?order.
I like Andrie's better:
dat <- read.table(text="z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4", header=T)
Three lines of code:
dat <- dat[order(dat$type), ]
x <- by(dat, dat$type, nrow)
dat$y <- unlist(sapply(x, function(z) z:1))
I Edited my response to adapt for the comments Andrie mentioned. This works but if you went this route instead of Andrie's you're crazy.

Resources