I have a data frame of 14 columns and thousands of rows. I want to count or select rows where value in column 1 is 0 and more than 0 in the other 13 columns, then count those were value is 0 in second column and more than 0 in the other 13 columns and so on for all 14 columns.
Any hint on how to do that ?
Many thanks
Try this. The first line is to replicate the data and the second line shows the counting result based on your logical expression
df <- data.frame(replicate(14, sample(0:5, 1000, replace = T)))
result <- sapply(1:14, function(i) {sum(df[,i]==0 & apply(df[-i]>0, 1, all))})
names(result) <- paste0("Col_", 1:14)
result
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6 Col_7 Col_8 Col_9 Col_10 Col_11 Col_12 Col_13 Col_14
12 12 19 15 18 20 19 13 19 15 12 17 15 18
Are you aware of the function apply? If you write a function that reads a vector of length 14 and and outputs T or F depending on whether the vector satisfies the requirement, then you can use apply to apply this function to all rows of the data.frame, yielding a vector of thousands of Ts and Fs that can be used for selecting or counting (the latter by simply putting the vector into the sum function).
Example:
cow <- function(colnr, x){#colnr is number of column you want zero, x is vector of length 14
all(x[-colnr] > 0) & x[colnr] == 0)
}
horse <- function(colnr){#produces sequence of Trues and Falses telling you which columns satisfy the condition
apply(yourdataframe, 1, cow)
}
#example output:
horse(1)
#while we're at it: create a vector of length 14 containing the number of rows satisfying the 14 conditions:
sapply(seq(1:14), horse)
The 1 in apply is because you want to apply to rows, not columns. The function sapply is like apply but then applying a function to each element of a vector rather than each row of a dataframe.
Update: this answer is the same as zyunaidi's which appeared while I was typing.
Using the sample data from zyurnaidi you can do this.
Find all 0 values in your data.frame using which with array indices set on TRUE, then remove duplicated rows (0 in other columns) and count the occurence per column:
set.seed(1234)
df <- data.frame(replicate(14, sample(0:5, 1000, replace = T)))
a <- which(df == 0, arr.ind = T)
table(a[ !(duplicated(a[, 1]) | duplicated(a[, 1], fromLast=T)), 2])
1 2 3 4 5 6 7 8 9 10 11 12 13 14
18 26 19 14 11 20 21 10 24 21 15 11 22 11
Related
Sorry, probably my question is not so clear, because I can not formulate it. I wiil explain by example.
I have two dataframesdf and df1:
df <- data.frame(a = c(25,15,35,45,2))
df1 <- data.frame(b = c(28,25,24,43,10))
I want to merge two dataframes with condition if values == +-5 and create column distance. For example, first element in column a is 25, I want to compare 25 with all elements in column b, and I want to select only 25 == +- 25. The output should look like:
a b distance
25 28 3
24 1
25 0
15 10 5
45 43 2
And values which are not equal +- 5 should be excluded like 2 and 35.
We may use outer to create a logical matrix, get the row/column index with which and arr.ind = TRUE. Use the index to subset the 'a', and 'b' column from corresponding datasets and get the difference`
i1 <- which(outer(df$a, df1$b, FUN = function(x, y)
abs(x - y) <=5), arr.ind = TRUE)
transform(data.frame(a = df$a[i1[,1]], b = df1$b[i1[,2]]), distance = abs(a - b))
-output
a b distance
1 25 28 3
2 25 25 0
3 25 24 1
4 45 43 2
5 15 10 5
I have a large data frame of 5520 by 5520. After every 120 rows I need to add another row. The values of these new rows are contained within dataframes of one row by 5520.
Using rbind I add these rows at the end of the table which I do not want. And gives me an error:
fabioN2 <-rbind(fabioN2, auf2[1,]) Error in match.names(clabs, names(xi)) : names do not match previous names
Using tibble with add_row I also get an error:
> fabioN2 %>% add_row(fabioN2, auf2[1,], .after = 120) Error: New rows can't add columns. x Can't find columns fabioN2, X1, X2, X3, X4, and 5516 more in .data.
With fabioN2 being the large dataframe and auf2 containing the values I want to add to fabioN2.
Undoubtedly the codes are wrong and based on the errors, I have to match the names of the columns of both the dataframes, something I want to prevent considering the 5520 different columns names.
Anyone know how to easily add these dataframes at the desired spots?
I hope I got the logic right for your problem ... I did it for a data.frame of 30 rows to add a row every 10 rows (as 120 is to much for a reproducable example in terms of fitting the output in the answer).
library(dplyr)
r <- 3 # your number is 46 (5520/120)
l <- 10 # your number is 120
# your long data.frame where you want to fit in ever l rows
df1 <- data.frame(dx = c("a","a","a","a","a","a","a","a","a","a",
"c","c","c","c","c","c","c","c","c","c",
"e","e","e","e","e","e","e","e","e","e"))
# your data.frame of one row to fit in every l rows
df2 <- data.frame(dy = c("X"))
# set colnames to be identical
names(df2) <- colnames(df1)
# use row number as ID and set it of as needed with the help of integer division
dff1 <- df1 %>%
dplyr::mutate(ID = dplyr::row_number()) %>%
dplyr::mutate(ID = ID + (ID-1) %/% l)
# repeat your one row df according to the quantity needed and use the row number with set off calculation
dff2 <- df2 %>%
dplyr::slice(rep(row_number(), r)) %>%
dplyr::mutate(ID = dplyr::row_number()) %>%
dplyr::mutate(ID = (ID) * l + ID)
# union both data.frames (I am supposing column types are identical!)
dff1 %>%
dplyr::union(dff2) %>%
dplyr::arrange(ID)
dx ID
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 a 9
10 a 10
11 X 11
12 c 12
13 c 13
14 c 14
15 c 15
16 c 16
17 c 17
18 c 18
19 c 19
20 c 20
21 c 21
22 X 22
23 e 23
24 e 24
25 e 25
26 e 26
27 e 27
28 e 28
29 e 29
30 e 30
31 e 31
32 e 32
33 X 33
I want to exclude the minimum as well as the maximum value of each row in a data frame. (If one of those value are repeated, only one should be excluded.)
I can exclude either the minimum, or the maximum, but not both.
I don't seem to find a way to combine those (which both work fine by themselves):
d[-which(d == min(d))[1]]
d[-which(d == max(d))[1]]
This doesn't work:
d[
-which(d == min(d))[1] &
-which(d == max(d))[1]
]
It gives the full row.
(I also tried an approach using apply(d, 1, min/max), but this also fails.)
Update
Remembered after looking at #Rich Pauloo's answer, we can directly use which.max and which.min to get index of minimum and maximum value
as.data.frame(t(apply(df, 1, function(x) x[-c(which.max(x), which.min(x))])))
# V1 V2 V3
#1 13 11 6
#2 15 8 18
#3 5 10 21
#4 14 12 17
#5 19 9 20
Here which.max/which.min will ensure that you get the index of first minimum and maximum respectively for each row.
Some other variations could be
as.data.frame(t(apply(df, 1, function(x)
x[-c(which.max(x == min(x)), which.max(x == max(x)))])))
If you want to use which we can do
as.data.frame(t(apply(df, 1, function(x)
x[-c(which(x == min(x)[1]), which(x == max(x)[1]))])))
data
set.seed(1234)
df <- as.data.frame(matrix(sample(25), 5, 5))
df
# V1 V2 V3 V4 V5
#1 3 13 11 16 6
#2 15 1 8 25 18
#3 24 5 4 10 21
#4 14 12 17 2 22
#5 19 9 20 7 23
You were very close! With data.frames you need to use a comma within the brackets to accomplish row-column subsetting.
Use which.max() and which.min() to return the index of the max and min values of a vector, respectively.
Bind those indices into a new vector with c().
Use - and the vector from 2. to subset your data frame for the desired rows.
Here's an example to copy/paste:
d <- data.frame(a = 1:5) # make example data.frame
d[-c(which.max(d$a), which.min(d$a)), ]
[1] 2 3 4
This will remove the rows containing the min and max values of score as shown in the example data frame.
library(tidyverse)
df <- tribble(~name, ~score,
'John', 10,
'Mike', 2,
'Mary', 11,
'Jane', 1,
'Jill', 5)
df %>%
arrange(score) %>%
slice(-1, -nrow(.))
# A tibble: 3 x 2
name score
<chr> <dbl>
1 Mike 2
2 Jill 5
3 John 10
We can use
t(apply(df, 1, function(x) x[!x %in% range(x)]))
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
I'm trying to find a loop that replaces NA's by designated values.
Say I have a data frame as follows (I actually have more rows):
a<-c(18,NA,12,33,32,14,15,55)
b<-c(18,30,12,33,32,14,15,NA)
c<-c(16,18,17,45,22,10,24,11)
d<-c(16,18,17,42,NA,10,24,11)
data<- data.frame(rbind(a,b,c,d))
names(data)<-rep(1:8)
All rows in my data frame are in pairs (row[1] and [2] are the first pair, row[3] and [4] are the second and so on).
I wish to replace all NA's by the corresponding value of the pair i.e to replace NA in the first pair by 30. Similarly, replace NA in the 4th row by 22.
Is there a loop I can carry out to treat each 2 rows as a pair and replace any NAs found by its corresponding value in the same pair?
I'd use R's built in vectorisation to find and replace NAs by the appropriate value. Seems like you want to replace by the row below when a row is odd numbered and the row above when it is even numbered...
# Locate NAs in data
nas <- which( is.na( data ) , arr.ind = TRUE )
# row col
#a 1 2
#d 4 5
#b 2 8
# Where to get replacement value from: below on odd rows and value above on even rows
rows <- nas[,1] %% 2
rows[ rows == 0 ] <- -1
repl <- cbind( ( nas[,1] + rows ) , nas[ ,2] )
# Do replacement
data[ nas ] <- data[ repl ]
# 1 2 3 4 5 6 7 8
#a 18 30 12 33 32 14 15 55
#b 18 30 12 33 32 14 15 55
#c 16 18 17 45 22 10 24 11
#d 16 18 17 42 22 10 24 11
I'm sure the creation of the replacement locations matrix could be a little cleaner, but this should be fast as it only uses vectorised operations.
Sure -- this does the trick:
for(i in 1:nrow(data)) {
missing <- which(is.na(data[i,]))
if(i%%2) {
data[i,missing] <- data[(i+1), missing]
} else {
data[i, missing] <- data[(i-1), missing]
}
}
It allows for missing observations in both the top and bottom row of each pair, and where there is a gap it fills in with the observation from the same column location in the other part of the pair.
note there's no error checking, or other nice stuff, so this is pretty raw.
Also, if they are truly pairs of data, there are better means of joining your observations than just sticking them all into a dataframe.