columnwise sum matching values to another column - r

Seems, I am missing some link here.
I have data frame
df<-data.frame(w=sample(1:3,10, replace=T), x=sample(1:3,10, replace=T), y=sample(1:3,10, replace=T), z=sample(1:3,10, replace=T))
> df
w x y z
1 3 1 1 3
2 2 1 1 3
3 1 3 2 2
4 3 1 3 1
5 2 2 1 1
6 1 2 2 3
7 1 2 2 2
8 2 2 2 3
9 1 3 3 3
10 2 2 1 1
I want to get the number of rows of each column which matches to 1st column.
sum(df$w==df$x)
[1] 3
sum(df$w==df$y)
[1] 2
sum(df$w==df$z)
[1] 1
I know using apply, I can do rowwise or colwise operations.
apply(df,2,length)
w x y z
10 10 10 10
How do I combine these two functions?

Try colSums
colSums(df[-1] == df[, 1])
# x y z
# 3 2 1
Or if you into *apply loops could try
vapply(df[-1], function(x) sum(x == df[, 1]), double(1))

Related

How to keep only first value in every sequence of duplicated values in R [duplicate]

This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 5 months ago.
I am trying to create a subset where I keep the first value in each sequence of numbers in a column. I tried to use:
df %>% group_by(x) %>% slice_head(n = 1)
But it only works for the first instance of each sequence.
An example data where x column contains the repeated sequence can be seen below:
x = c(2,2,2,3,3,3,1,1,1,5,5,5,2,2,2,1,1,1,3,3,3)
y = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 2 1
3 2 1
4 3 1
5 3 1
6 3 1
7 1 1
8 1 1
9 1 1
10 5 1
11 5 1
12 5 1
13 2 1
14 2 1
15 2 1
16 1 1
17 1 1
18 1 1
19 3 1
20 3 1
21 3 1
So the end result that I would like to achive is:
x = c(2,3,1,5,2,1,3)
y = c(1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 3 1
3 1 1
4 5 1
5 2 1
6 1 1
7 3 1
Could you please help or point me to any useful existing topics as I haven't managed to find it?
Thanks
You can try rleid from package data.table
> library(data.table)
> setDT(df)[!duplicated(rleid(x))]
x y
1: 2 1
2: 3 1
3: 1 1
4: 5 1
5: 2 1
6: 1 1
7: 3 1
Base R.
df[c(1, diff(df$x)) != 0, ]
Or also with helper functions from data.table.
library(data.table)
df[rowid(rleid(df$x)) == 1L, ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 13 2 1
# 16 1 1
# 19 3 1
Using rle and match.
df[match(with(rle(df$x), values), df$x), ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 1.1 2 1
# 7.1 1 1
# 4.1 3 1

How to get consecutive rank for multiple variables [duplicate]

This question already has answers here:
Create a ranking variable with dplyr?
(3 answers)
Closed 3 years ago.
I have a data set where 5 varieties (var) and 3 variables (x,y,z) are available. I need to rank these varieties for 3 variables. When there is tie in rank it shows gap before starting the following rank. I cannot get the consecutive rank. Here is my data
x<-c(3,3,4,5,5)
y<-c(5,6,4,4,5)
z<-c(2,3,4,3,5)
df<-cbind(x,y,z)
rownames(df) <- paste0("G", 1:nrow(df))
df <- data.frame(var = row.names(df), df)
I tried the following code for my result
res <- sapply(df, rank,ties.method='min')
res
var x y z
[1,] 1 1 3 1
[2,] 2 1 5 2
[3,] 3 3 1 4
[4,] 4 4 1 2
[5,] 5 4 3 5
I got x variable with rank 1 1 3 4 4 instead of 1 1 2 3 3. For y and z the same thing was found.
My desired result is
>res
var x y z
[1,] 1 1 2 1
[2,] 2 1 3 2
[3,] 3 2 1 3
[4,] 4 3 1 2
[5,] 5 3 2 4
I will be grateful if anyone helps me.
Well, an easy way would be to convert to factor and then integer
df[] <- lapply(df, function(x) as.integer(factor(x)))
df
# var x y z
#G1 1 1 2 1
#G2 2 1 3 2
#G3 3 2 1 3
#G4 4 3 1 2
#G5 5 3 2 4
One dplyr possibility could be:
df %>%
mutate_at(2:4, list(~ dense_rank(.)))
var x y z
1 G1 1 2 1
2 G2 1 3 2
3 G3 2 1 3
4 G4 3 1 2
5 G5 3 2 4
Or a base R possibility:
df[2:4] <- lapply(df[2:4], function(x) match(x, sort(unique(x))))
We can use data.table
library(data.table)
setDT(df)[, (2:4) := lapply(.SD, dense_rank), .SDcols = 2:4]
df
# var x y z
#1: G1 1 2 1
#2: G2 1 3 2
#3: G3 2 1 3
#4: G4 3 1 2
#5: G5 3 2 4

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

R - How create a variable based in another variable

I have:
v1 <- c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4)
and I want create v2 which assigns to v1 the number of sets of 3 elements:
v2 <- c(1,1,1,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2)
Explanation:
For the first three times a number is repeated the value corresponding to that number is a 1, for the second three times it's a 2, and so on.
v1 <- c(1,1,1,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4)
Use rle to find the run lengths:
l <- rle(v1)$lengths
#[1] 3 3 9 6
Create a sequence 1:n for each run length n:
s <- sequence(l)
#[1] 1 2 3 1 2 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6
Use integer division:
(s - 1) %/% 3 + 1
#[1] 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2

How Can I check the occurences of the values in each individual in R?

Let's say I have a data.frame that looks like this:
ID B
1 1
1 2
1 1
1 3
2 2
2 2
2 2
2 2
3 2
3 10
3 2
Now I want to check the occurrences of B under each ID, such as that for no. 1, 1 happens twice, 2 and 3 happens 1 time each. And in no. 2, only 2 happens 4 times. How should I accomplish this? I tried to use table in ddply but somehow it did not work. Thanks.
It seems like you may just want a table
> table(dat)
## B
## ID 1 2 3 10
## 1 2 1 1 0
## 2 0 4 0 0
## 3 0 2 0 1
Then the following shows that for ID equal to 1, there are two 1s, one 2, and one 3.
> table(dat)[1, ]
## 1 2 3 10
## 2 1 1 0
And here's an aggregate solution:
> with(data, aggregate(B, list(ID=ID, B=B), length))
ID B x
1 1 1 2
2 1 2 1
3 2 2 4
4 3 2 2
5 1 3 1
6 3 10 1
Here's an approach using "dplyr" (if I understood your question correctly):
library(dplyr)
mydf %.% group_by(ID, B) %.% summarise(count = n())
# Source: local data frame [6 x 3]
# Groups: ID
#
# ID B count
# 1 1 1 2
# 2 1 2 1
# 3 1 3 1
# 4 2 2 4
# 5 3 2 2
# 6 3 10 1
In "plyr", I guess it would be something like:
library(plyr)
ddply(mydf, .(ID, B), summarise, count = length(B))
In base R, you could do something like the following and just remove the rows with 0:
data.frame(table(mydf))
# ID B Freq
# 1 1 1 2
# 2 2 1 0
# 3 3 1 0
# 4 1 2 1
# 5 2 2 4
# 6 3 2 2
# 7 1 3 1
# 8 2 3 0
# 9 3 3 0
# 10 1 10 0
# 11 2 10 0
# 12 3 10 1
And the data.table solution because there must be:
data[, .N, by=c('ID','B')]
The above won't work if you try to apply it to a data.frame. It must be converted to a data.table first. With more recent versions of "data.table", this is most easily done with setDT (as recommended by David in the comments):
library(data.table)
setDT(data)[, .N, by=c('ID', 'B')]

Resources