There are different results between subset code in R [duplicate] - r

This question already has an answer here:
R subset with condition using %in% or ==. Which one should be used? [duplicate]
(1 answer)
Closed 2 years ago.
The results of:
BB= RB[RB$Rep, %in% c(“1”,”3”)] and
Bb=subset(RB,Rep ==c(“1”,”3”) )
are different.
Please tell me what the problem is?

When you use == the comparison is done in a sequential order.
Consider this example :
df <- data.frame(a = 1:6, b = c(1:3, 3:1))
df
# a b
#1 1 1
#2 2 2
#3 3 3
#4 4 3
#5 5 2
#6 6 1
When you use :
subset(df, b == c(1, 3))
# a b
#1 1 1
#4 4 3
1st value of b is compared with 1, 2nd with 3. Now as you have vector of shorter length, the values are recycled meaning 3rd value is again compared to 1, 4th value with 3 and so on until end of the dataframe. Hence, you get row 1 and 4 as output here.
When you use %in% it checks for either 1 or 3 is present in b. So it selects all the rows where value 1 or 3 is present in b.
subset(df, b %in% c(1, 3))
# a b
#1 1 1
#3 3 3
#4 4 3
#6 6 1

Related

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

Repeat rows with a variable in r [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 3 years ago.
I have a data.frame with n rows and I would like to repeat this rows according to the observation of another variable
This is an example for a data.frame
df <- data.frame(a=1:3, b=letters[1:2])
df
a b
1 1 a
2 2 b
3 3 c
And this one is an example for a variable
df1 <- data.frame(x=1:3)
df1
x
1 1
2 2
3 3
In the next step I would like to repeat every row from the df with the observation of df1
So that it would look like this
a b
1 1 a
2 2 b
3 2 b
4 3 c
5 3 c
6 3 c
If you have any idea how to solve this problem, I would be very thankful
You simply can repeat the index like:
df[rep(1:3,df1$x),]
# a b
#1 1 a
#2 2 b
#2.1 2 b
#3 3 c
#3.1 3 c
#3.2 3 c
or not fixed to size 3
df[rep(seq_along(df1$x),df1$x),]

Filling cell data with mean for each unique name [duplicate]

This question already has answers here:
replace NA with groups mean in a non specified number of columns [duplicate]
(2 answers)
Closed 3 years ago.
I have been using R for the past couple days and I have question that I am a little stumped on. I have a dataframe with bidder names and bids where some of the bids are empty. I am having trouble implementing a dynamic way to take the average bid for each unique bidder and apply that to the empty cells. This line of code below will take the mean bid for all of the unique bidders. All I need to do is place the mean value of unique_bid in the empty cells that shares the same bidder.
unique_bid <- aggregate(bid ~ bidder, auction[complete.cases(auction),], mean)
Here is a picture of what the dataframe looks like.
You could use ave.
Example:
df = data.frame(a = c(1,1,1,2,2,2), b=c(1,2,NA,4,5,NA),c= c(1,2,3,4,5,6))
> df
a b c
1 1 1 1
2 1 2 2
3 1 NA 3
4 2 4 4
5 2 5 5
6 2 NA 6
Do:
sel = is.na(df$b)
df$b[sel] = ave(df$b, df$a, FUN = function(x){mean(x, na.rm = T)})[sel]
ave will use apply the function FUN to df$b while grouping by df$a. The sel will select NA elements of df$b and replace them by the correponding function's result.
Result:
> df
a b c
1 1 1.0 1
2 1 2.0 2
3 1 1.5 3
4 2 4.0 4
5 2 5.0 5
6 2 4.5 6

Construct dataframe with levels [duplicate]

This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
Closed 5 years ago.
I just migrated from Python to R and I would like to know if there is any function in R which is similar to pandas.MultiIndex.from_product?
Example:
letters <- c('a', 'b')
numbers <- c(1, 2, 3)
df <- somefunction(letters, numbers)
df
letters numbers
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
Yes:
> letters <- c('a', 'b')
> numbers <- c(1, 2, 3)
> expand.grid(letters=letters, numbers=numbers)
letters numbers
1 a 1
2 b 1
3 a 2
4 b 2
5 a 3
6 b 3
You can also use CJ from the data.table package. It is faster. But the result is not an ordinary dataframe, it is a datatable:
> library(data.table)
> CJ(letters=letters, numbers=numbers)
letters numbers
1: a 1
2: a 2
3: a 3
4: b 1
5: b 2
6: b 3

R - Using for loop to conditionally change values in a dataframe

All of the variables are on the same scale in the data.frame 1-5.
Example of data.frame
rpi_invert
A B C D
5 2 4 1
3 5 5 2
1 1 3 4
For all values that equal 5 I would like to change it to 1.
for 4 change to 2.
for 2 change to 4.
for 1 change to 5.
Example of data.frame after values have been changed.
rpi_invert
A B C D
1 4 2 5
3 1 1 4
5 5 3 2
What I have tired.
for(b in colnames(rpi_invert)){
rpi_invert[[b]][rpi_invert[[b]] == 5] <- 1
rpi_invert[[b]][rpi_invert[[b]] == 4] <- 2
rpi_invert[[b]][rpi_invert[[b]] == 2] <- 4
rpi_invert[[b]][rpi_invert[[b]] == 1] <- 5
}
This will only change the values in the first row and not the second column.
for(b in colnames(rpi_invert)){
rpi_invert <- ifelse(rpi_invert[[b]] == 5,1,
ifelse(rpi_invert[[b]] == 4,2,
ifelse(rpi_invert[[b]] == 2,4,
ifelse(rpi_invert[[b]] == 1,5,rpi_invert[[b]]))))
}
But this gives me the error:
Error in rpi_invert[[b]] : subscript out of bounds
If I try to the same methods for an individual column instead of looping through the data.frame then both methods work so I am not sure what is the problem.
I am sure what I am trying to do can be done more efficiently without a for loop probably with some type of apply function but I am not sure how.
Any help will be appreciated please let me know if further information is needed.
You can try (if your data.frame is df):
3-(df-3)
# A B C D
#1 1 4 2 5
#2 3 1 1 4
#3 5 5 3 2
or, same but written a bit differently: 6-df

Resources