Sorry, probably my question is not so clear, because I can not formulate it. I wiil explain by example.
I have two dataframesdf and df1:
df <- data.frame(a = c(25,15,35,45,2))
df1 <- data.frame(b = c(28,25,24,43,10))
I want to merge two dataframes with condition if values == +-5 and create column distance. For example, first element in column a is 25, I want to compare 25 with all elements in column b, and I want to select only 25 == +- 25. The output should look like:
a b distance
25 28 3
24 1
25 0
15 10 5
45 43 2
And values which are not equal +- 5 should be excluded like 2 and 35.
We may use outer to create a logical matrix, get the row/column index with which and arr.ind = TRUE. Use the index to subset the 'a', and 'b' column from corresponding datasets and get the difference`
i1 <- which(outer(df$a, df1$b, FUN = function(x, y)
abs(x - y) <=5), arr.ind = TRUE)
transform(data.frame(a = df$a[i1[,1]], b = df1$b[i1[,2]]), distance = abs(a - b))
-output
a b distance
1 25 28 3
2 25 25 0
3 25 24 1
4 45 43 2
5 15 10 5
Related
to fill an empty column of a dataframe based on a condition taking another column into account, i have found following solution, which works fine, but is somehow a little bit ugly. does anybody know a more elegant way to solve this?
base::set.seed(123)
test_df <- base::data.frame(vec1 = base::sample(base::seq(1, 100, 1), 50), vec2 = base::seq(1, 50, 1), vec3 = NA)
for (a in 1:base::nrow(test_df)){
spc_test_df <- test_df[a, ]
# select the specific row of the dataframe
if(spc_test_df$vec1 <= 25 | spc_test_df$vec1 >= 75){
# evaluate whether the deviation is below/above the threshold
spc_test_df$vec3 <- 1
# if so, write TRUE
} else {
spc_test_df$vec3 <- 0
# if not so, write FALSE
}
test_df[a, ] <- spc_test_df
# write the specific row back to the dataframe
}
There is no need for a for-loop as you can use vectorized solutions in this case. Three options on how to solve this problem:
# option 1
test_df$vec3 <- +(test_df$vec1 <= 25 | test_df$vec1 >= 75)
# option 2
test_df$vec3 <- as.integer(test_df$vec1 <= 25 | test_df$vec1 >= 75)
# option 3
test_df$vec3 <- ifelse(test_df$vec1 <= 25 | test_df$vec1 >= 75, 1, 0)
which in all cases gives:
vec1 vec2 vec3
1 5 1 1
2 6 2 1
3 61 3 0
4 20 4 1
....
47 3 47 1
48 55 48 0
49 44 49 0
50 97 50 1
(only first and last four rows presentend)
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
Given two data frames s and q with five observations each:
set.seed(8)
s <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
q <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
> s
id t0 t1
1 Z 8 20
2 Z 3 12
3 X 10 19
4 X 8 21
5 Z 7 13
> q
id t0 t1
1 X 3 30
2 Z 5 12
3 Z 7 23
4 Z 3 21
5 X 7 27
The midpoint for the observations between the variables t0 and t1 is (e.g. for s data):
s$t0+(s$t1-s$t0)/2
To find the index of the (first) observation in s whose midpoint is closest to, say, the first observation in q I can do:
i <- which.min(abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2)))
s[i,]
gives:
id t0 t1
3 X 10 19
But I cannot figure out how to find the same index in the original data s if I also want to condition on the id variable (e.g. pseudo code like: which.min(....) & s$id == q$id[1] - in this case the midpoint is sought among ids being 'X'). This SO is close but not spot on.
Again: I need a index to be used in the original 5-row data set.
Set the which.min argument to infinity when your condition is not obeyed:
val <- abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2))
val[s$id != q$id[1]] <- Inf
i <- which.min(val)
By the way, you can simplify the expression in the first character as:
val <- abs((s$t0+s$t1)/2-(q$t0[1]+q$t1[1])/2)
or even
val <- abs(s$t0+s$t1-q$t0[1]-q$t1[1])/2
I have a data frame of 14 columns and thousands of rows. I want to count or select rows where value in column 1 is 0 and more than 0 in the other 13 columns, then count those were value is 0 in second column and more than 0 in the other 13 columns and so on for all 14 columns.
Any hint on how to do that ?
Many thanks
Try this. The first line is to replicate the data and the second line shows the counting result based on your logical expression
df <- data.frame(replicate(14, sample(0:5, 1000, replace = T)))
result <- sapply(1:14, function(i) {sum(df[,i]==0 & apply(df[-i]>0, 1, all))})
names(result) <- paste0("Col_", 1:14)
result
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6 Col_7 Col_8 Col_9 Col_10 Col_11 Col_12 Col_13 Col_14
12 12 19 15 18 20 19 13 19 15 12 17 15 18
Are you aware of the function apply? If you write a function that reads a vector of length 14 and and outputs T or F depending on whether the vector satisfies the requirement, then you can use apply to apply this function to all rows of the data.frame, yielding a vector of thousands of Ts and Fs that can be used for selecting or counting (the latter by simply putting the vector into the sum function).
Example:
cow <- function(colnr, x){#colnr is number of column you want zero, x is vector of length 14
all(x[-colnr] > 0) & x[colnr] == 0)
}
horse <- function(colnr){#produces sequence of Trues and Falses telling you which columns satisfy the condition
apply(yourdataframe, 1, cow)
}
#example output:
horse(1)
#while we're at it: create a vector of length 14 containing the number of rows satisfying the 14 conditions:
sapply(seq(1:14), horse)
The 1 in apply is because you want to apply to rows, not columns. The function sapply is like apply but then applying a function to each element of a vector rather than each row of a dataframe.
Update: this answer is the same as zyunaidi's which appeared while I was typing.
Using the sample data from zyurnaidi you can do this.
Find all 0 values in your data.frame using which with array indices set on TRUE, then remove duplicated rows (0 in other columns) and count the occurence per column:
set.seed(1234)
df <- data.frame(replicate(14, sample(0:5, 1000, replace = T)))
a <- which(df == 0, arr.ind = T)
table(a[ !(duplicated(a[, 1]) | duplicated(a[, 1], fromLast=T)), 2])
1 2 3 4 5 6 7 8 9 10 11 12 13 14
18 26 19 14 11 20 21 10 24 21 15 11 22 11
I'm trying to find a loop that replaces NA's by designated values.
Say I have a data frame as follows (I actually have more rows):
a<-c(18,NA,12,33,32,14,15,55)
b<-c(18,30,12,33,32,14,15,NA)
c<-c(16,18,17,45,22,10,24,11)
d<-c(16,18,17,42,NA,10,24,11)
data<- data.frame(rbind(a,b,c,d))
names(data)<-rep(1:8)
All rows in my data frame are in pairs (row[1] and [2] are the first pair, row[3] and [4] are the second and so on).
I wish to replace all NA's by the corresponding value of the pair i.e to replace NA in the first pair by 30. Similarly, replace NA in the 4th row by 22.
Is there a loop I can carry out to treat each 2 rows as a pair and replace any NAs found by its corresponding value in the same pair?
I'd use R's built in vectorisation to find and replace NAs by the appropriate value. Seems like you want to replace by the row below when a row is odd numbered and the row above when it is even numbered...
# Locate NAs in data
nas <- which( is.na( data ) , arr.ind = TRUE )
# row col
#a 1 2
#d 4 5
#b 2 8
# Where to get replacement value from: below on odd rows and value above on even rows
rows <- nas[,1] %% 2
rows[ rows == 0 ] <- -1
repl <- cbind( ( nas[,1] + rows ) , nas[ ,2] )
# Do replacement
data[ nas ] <- data[ repl ]
# 1 2 3 4 5 6 7 8
#a 18 30 12 33 32 14 15 55
#b 18 30 12 33 32 14 15 55
#c 16 18 17 45 22 10 24 11
#d 16 18 17 42 22 10 24 11
I'm sure the creation of the replacement locations matrix could be a little cleaner, but this should be fast as it only uses vectorised operations.
Sure -- this does the trick:
for(i in 1:nrow(data)) {
missing <- which(is.na(data[i,]))
if(i%%2) {
data[i,missing] <- data[(i+1), missing]
} else {
data[i, missing] <- data[(i-1), missing]
}
}
It allows for missing observations in both the top and bottom row of each pair, and where there is a gap it fills in with the observation from the same column location in the other part of the pair.
note there's no error checking, or other nice stuff, so this is pretty raw.
Also, if they are truly pairs of data, there are better means of joining your observations than just sticking them all into a dataframe.