Compare vector to a dataframe - r

I have a dataframe that looks something like -
test A B C
28 67 4 23
45 82 43 56
34 8 24 42
I need to compare test to the other three columns in that I just need the number of elements in the other column that is less than the corresponding element in the test column.
So the desired output is -
test A B C result
28 67 4 23 2
45 82 43 56 1
34 8 24 42 2
When I tried -
comp_vec = "test"
name_vec = c("A", "B", "C")
rowSums(df[, comp_vec] > df[, name_vec])
I get the error -
Error in Ops.data.frame(df[, comp_vec], df[, name_vec]) :
‘>’ only defined for equally-sized data frames
I am looking for a way without replicating test to match size of dataframe.

You can use sapply to return a vector of mapping the df$test column against the other three columns. That will return a T/F matrix that you can do rowSums, and set as your result column.
df <- data.frame(test = c(28, 45, 34), A = c(67, 82, 8), B = c(4, 43, 24), C = c(23, 56, 42))
df$result <- rowSums(sapply(df[,2:4], function(x) df$test > x))
> df
test A B C result
1 28 67 4 23 2
2 45 82 43 56 1
3 34 8 24 42 2
I noticed your expected results has 82 for the second row of A, whereas its 5 in your starting example.

df$result <- apply(df, 1, function(x) sum(x < x[1]))
Use apply, specify 1 to indicate by row. x < x[1] will give a vector of TRUE/FALSE if the value at each position in the row is smaller than the first column's value. Use sum to give the number of TRUE values.
# test A B C result
# 1 28 67 4 23 2
# 2 45 82 43 56 1
# 3 34 8 24 42 2

Related

Add number to vector repeatdly and duplicate vector

I have a two value
3 and 5
and I make vector
num1 <- 3
num2 <- 12
a <- c(num1, num2)
I want add number(12) to vector "a" and
also I want to make new vector with repeat and append
like this:
3,12, 15,24, 27,36, 39,48 ....
repeat number "n" is 6
I don't have any idea.
Here are two methods in base R.
with outer, you could do
c(outer(c(3, 12), (12 * 0:4), "+"))
[1] 3 12 15 24 27 36 39 48 51 60
or with sapply, you can explicitly loop through and calculate the pairs of sums.
c(sapply(0:4, function(i) c(3, 12) + (12 * i)))
[1] 3 12 15 24 27 36 39 48 51 60
outer returns a matrix where every pair of elements of the two vectors have been added together. c is used to return a vector. sapply loops through 0:4 and then calculates the element-wise sum. It also returns a matrix in this instance, so c is used to return a vector.
Here is a somewhat generic function that takes as input your original vector a, the number to add 12, and n,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n/len1), function(i) x*i)
v2 <- rep(v1, each = n/length(v1))
v3 <- rep(vec, n/len1)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48
f1(a, 11, 12)
#[1] 3 12 14 23 25 34 36 45 47 56 58 67 69 78
f1(a, 3, 2)
#[1] 3 12 6 15
EDIT
If by n=6 you mean 6 times the whole vector then,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n), function(i) x*i)
v2 <- rep(v1, each = len1)
v3 <- rep(vec, n)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48 51 60 63 72 75 84
Using rep for repeating and cumsum for the addition:
n = 6
rep(a, n) + cumsum(rep(c(12, 0), n))
# [1] 15 24 27 36 39 48 51 60 63 72 75 84

Move cells to new column at each 48th row

I have list with names in A1:A144 and I want to move A49:A96 to B1:B48 and A97:144 to C1:C48.
So for each 48th row, I want the next 48 rows moved to a new column.
How to do that?
If you want to consider a VBA alternative then:
Sub MoveData()
nF = 1
nL = 48
nSize = Cells(Rows.Count, "A").End(xlUp).Row
nBlock = nSize / nL
For k = 1 To nBlock
nF = nF + 48
nL = nL + 48
Range("A" & nF & ":A" & nL).Copy Cells(1, k + 1)
Range("A" & nF & ":A" & nL).ClearContents
Next k
End Sub
Not sure how scalable this solution is, but it does work.
First let's pretend your names are x and you want the solution to be in new.df
number.shifts <- ceiling(length(x) / 48) # work out how many columns we need
# create an empty (NA) data frame with the dimensions we need
new.df <- matrix(data = NA, nrow = length(x), ncol = number.shifts)
# run a for-loop over the x, shift the column over every 48th row
j <- 1
for (i in 1:length(x)){
if (i %% 48 == 0) {j <- j + 1}
new.df[i,j] <- x[i]
}
I think you have to elaborate on your question a little more. Do you have the data in R or in Excel and do you want the output to be in R or in Excel?
That beeing said, if x is your vector indicating clusters
x <- rep(1:3, each = 48)
and y is the variable containing names or whatever that you want to distribute over columns A:C (each having 48 rows),
y <- sample(letters, 3 * 48, replace = TRUE)
you can do this:
y.wide <- do.call(cbind, split(y, x))
Just as there is stack in R to create a very long representation of a group of columns, there is unstack to take a long column and make it into a wide form.
Here's a basic example:
mydf <- data.frame(A = 1:144)
mydf$groups <- paste0("A", gl(n=3, k=48)) ## One of many ways to create groups
mydf2 <- unstack(mydf)
head(mydf2)
# A1 A2 A3
# 1 1 49 97
# 2 2 50 98
# 3 3 51 99
# 4 4 52 100
# 5 5 53 101
# 6 6 54 102
tail(mydf2)
# A1 A2 A3
# 43 43 91 139
# 44 44 92 140
# 45 45 93 141
# 46 46 94 142
# 47 47 95 143
# 48 48 96 144

Sample to have an equal number of each sex within groups in R

First things, first. Here are my data:
lat <- c(12, 12, 58, 58, 58, 58, 58, 45, 45, 45, 45, 45, 45, 64, 64, 64, 64, 64, 64, 64)
long <- c(-14, -14, 139, 139, 139, 139, 139, -68, -68, -68, -68, -68, 1, 1, 1, 1, 1, 1, 1, 1)
sex <- c("M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "F", "M")
score <- c(2, 6, 3, 6, 5, 4, 3, 2, 3, 9, 9, 8, 6, 5, 6, 7, 5, 7, 5, 1)
data <- data.frame(lat, long, sex, score)
The data should look like this:
lat long sex score
1 12 -14 M 2
2 12 -14 M 6
3 58 139 M 3
4 58 139 M 6
5 58 139 F 5
6 58 139 M 4
7 58 139 M 3
8 45 -68 F 2
9 45 -68 M 3
10 45 -68 M 9
11 45 -68 M 9
12 45 -68 F 8
13 45 1 M 6
14 64 1 F 5
15 64 1 M 6
16 64 1 F 7
17 64 1 F 5
18 64 1 F 7
19 64 1 F 5
20 64 1 M 1
I am at my wits end trying to figure this one out. The variables are latitude, longitude, sex and score. I would like to have an equal number of males and females within each location (i.e. with the same longitude and latitude). For instance, the second location (rows 3 to 7) has only one female. This female should be retained and one male from the remaining individuals should also be retained (by random sampling, perhaps). Some locations have only information about one sex, e.g. the first location (rows 1 and 2) has only data on males. The rows from this location should be dropped (since there are no females). All going according to plan the final dataset should look something like this:
lat2 long2 sex2 score2
1 58 139 F 5
2 58 139 M 4
3 45 -68 F 2
4 45 -68 M 3
5 45 -68 M 9
6 45 -68 F 8
7 64 1 M 6
8 64 1 F 5
9 64 1 F 7
10 64 1 M 1
Any help would be appreciated.
Here's a solution with lapply:
data[unlist(lapply(with(data, split(seq.int(nrow(data)), paste(lat, long))),
# 'split' splits the sequence of row numbers (indices) along the unique
# combinations of 'lat' and 'long'
# 'lapply' applies the following function to all sub-sequences
function(x) {
# which of the indices are for males:
male <- which(data[x, "sex"] == "M")
# which of the indices are for females:
female <- which(data[x, "sex"] == "F")
# sample from the indices of males:
s_male <- sample(male, min(length(male), length(female)))
# sample from the indices of females:
s_female <- sample(female, min(length(male), length(female)))
# combine both sampled indices:
x[c(s_male, s_female)]
})), ]
# The function 'lappy' returns a list of indices which is transformed to a vector
# using 'unlist'. These indices are used to subset the original data frame.
The result:
lat long sex score
9 45 -68 M 3
11 45 -68 M 9
12 45 -68 F 8
8 45 -68 F 2
7 58 139 M 3
5 58 139 F 5
20 64 1 M 1
15 64 1 M 6
19 64 1 F 5
16 64 1 F 7
Below is a quick way to go about it, which involves creating a temporary column of the lat-long combination. We split the DF according to this column, count the M/F in each split, sample appropriately, then re-combine.
# First, We call the dataframe something other than "data" ;)
mydf <- data.frame(lat, long, sex, score)
# create a new data frame with a temporary column, which concatenates the lat & long.
mydf.new <- data.frame(mydf, latlong=paste(mydf$lat, mydf$long, sep=","))
# Split the data frame according to the lat-long location
mydf.splat <- split(mydf.new, mydf.new$latlong)
# eg, taking a look at one of our tables:
mydf.splat[[4]]
sampled <-
lapply(mydf.splat, function(tabl) {
Ms <- sum(tabl$sex=="M")
Fs <- sum(tabl$sex=="F")
if(Fs == 0 || Ms ==0) # If either is zero, we drop that location
return(NULL)
if(Fs == Ms) # If they are both equal, no need to sample.
return(tabl)
# If number of Females less than Males, return all Females
# and sample from males in ammount equal to Females
if (Fs < Ms)
return(tabl[c(which(tabl$sex=="F"), sample(which(tabl$sex=="M"), Fs)), ])
if (Ms < Fs) # same as previous, but for Males < Femals
return(tabl[c(which(tabl$sex=="M"), sample(which(tabl$sex=="F"), Ms)), ])
stop("hmmm... something went wrong.") ## We should never hit this line, but just in case.
})
# Flatten into a single table
mydf.new <- do.call(rbind, sampled)
# Clean up
row.names(mydf.new) <- NULL # remove the row names that were added
mydf.new$latlong <- NULL # remove the temporary column that we added
RESULTS
mydf.new
# lat long sex score
# 1 45 -68 F 2
# 2 45 -68 F 8
# 3 45 -68 M 9
# 4 45 -68 M 3
# 5 58 139 F 5
# 6 58 139 M 3
# 7 64 1 M 6
# 8 64 1 M 1
# 9 64 1 F 7
# 10 64 1 F 5
This returns the values as list elements:
spl <- split(data, interaction(data$lat, data$long) ,drop=TRUE)
# interaction creates all the two way pairs from those two vectors
# drop is needed to eliminate the dataframes with no representation
res <- lapply(spl, function(x) { #First find the nuber of each gender to select
N=min(table(x$sex)) # then sample each sex separately
rbind( x[ x$sex=="M" & row.names(x) %in% sample(row.names(x[x$sex=="M",] ), N) , ],
# One (or both) of these will be "sampling" all of that sex.
x[ x$sex=="F" & row.names(x) %in% sample(row.names(x[x$sex=="F", ]), N) , ] )
} )
res
#------------
$`45.-68`
lat long sex score
9 45 -68 M 3
11 45 -68 M 9
8 45 -68 F 2
12 45 -68 F 8
$`12.-14` # So there were no women in this group and zero could be matched
[1] lat long sex score
<0 rows> (or 0-length row.names)
$`45.1`
[1] lat long sex score
<0 rows> (or 0-length row.names)
$`64.1`
lat long sex score
15 64 1 M 6
20 64 1 M 1
16 64 1 F 7
17 64 1 F 5
$`58.139`
lat long sex score
7 58 139 M 3
5 58 139 F 5
,,,, but if you wanted it as a dataframe you can just use do.call(rbind, res):
> do.call(rbind, res)
lat long sex score
45.-68.10 45 -68 M 9
45.-68.11 45 -68 M 9
45.-68.8 45 -68 F 2
45.-68.12 45 -68 F 8
64.1.15 64 1 M 6
64.1.20 64 1 M 1
64.1.17 64 1 F 5
64.1.18 64 1 F 7
58.139.6 58 139 M 4
58.139.5 58 139 F 5

How to transform a dataframe in an ordered matrix?

Please, input the following code:
A <- matrix(11, nrow = 4, ncol = 3)
A[,2] <- seq(119, 122, 1)
A[,3] <- seq(45, 42)
B <- matrix(39, nrow = 4, ncol = 3)
B[,2] <- seq(119, 122, 1)
B[,3] <- seq(35, 32)
C <- matrix(67, nrow = 4, ncol = 3)
C[,2] <- seq(119, 122, 1)
C[,3] <- seq(27, 24)
D <- rbind(A, B, C)
You will get D which is a 12 x 3 matrix; I would like to know the most efficient way to obtain Mat starting from D.
> Mat
11 39 67
119 45 35 27
120 44 34 26
121 43 33 25
122 42 32 24
In fact, Mat is the last column of D indexed by the first and the second column of D; e.g. consider Mat[1,1] which is equal to 45: it comes from the only row of D which is identified by 11 and 119.
How may I obatin it?
Thanks,
You can use xtabs:
xtabs(D[,3]~D[,2]+D[,1])
D[, 1]
D[, 2] 11 39 67
119 45 35 27
120 44 34 26
121 43 33 25
122 42 32 24
library(reshape2)
dcast(data.frame(D), X2 ~ X1)

How to look up field name from an aggregate result

I am trying to look up the index or name of a data frame based on the maximum of the aggrate values of that data frame, for example:
df <- data.frame(
id = 1:6,
v1 = c(3, 20, 34, 23, 23, 56),
v2 = c(1, 3, 4, 10, 30, 40),
v3 = c(20, 35, 60, 60, 70, 80))
id v1 v2 v3
1 1 3 1 20
2 2 20 3 35
3 3 34 4 60
4 4 23 10 60
5 5 23 30 70
6 6 56 40 80
> colSums(as.data.frame(df[[1]]))
df[[1]]
21
> colSums(as.data.frame(df[[2]]))
df[[2]]
159
> colSums(as.data.frame(df[[3]]))
df[[3]]
88
So for example the maximum result using colSums is 159, and I'm trying to figure out how to to return 'df[[2]]'
First, you can simply run colSums directly on your data.frame
> colSums(df)
id v1 v2 v3
21 159 88 325
Subsetting is easy too
> df[which.max(colSums(df))]
v3
1 20
2 35
3 60
4 60
5 70
6 80
Or, if you just want the index, as implied in your first line:
> which.max(colSums(df))
v3
4
Also note that if you expect there might be more than one column with the same maximum sum, and you want to return all of them, you can use which(colSums(df) == max(colSums(df))) instead of which.max, which only returns the first occurrence.

Resources