Filling cell data with mean for each unique name [duplicate] - r

This question already has answers here:
replace NA with groups mean in a non specified number of columns [duplicate]
(2 answers)
Closed 3 years ago.
I have been using R for the past couple days and I have question that I am a little stumped on. I have a dataframe with bidder names and bids where some of the bids are empty. I am having trouble implementing a dynamic way to take the average bid for each unique bidder and apply that to the empty cells. This line of code below will take the mean bid for all of the unique bidders. All I need to do is place the mean value of unique_bid in the empty cells that shares the same bidder.
unique_bid <- aggregate(bid ~ bidder, auction[complete.cases(auction),], mean)
Here is a picture of what the dataframe looks like.

You could use ave.
Example:
df = data.frame(a = c(1,1,1,2,2,2), b=c(1,2,NA,4,5,NA),c= c(1,2,3,4,5,6))
> df
a b c
1 1 1 1
2 1 2 2
3 1 NA 3
4 2 4 4
5 2 5 5
6 2 NA 6
Do:
sel = is.na(df$b)
df$b[sel] = ave(df$b, df$a, FUN = function(x){mean(x, na.rm = T)})[sel]
ave will use apply the function FUN to df$b while grouping by df$a. The sel will select NA elements of df$b and replace them by the correponding function's result.
Result:
> df
a b c
1 1 1.0 1
2 1 2.0 2
3 1 1.5 3
4 2 4.0 4
5 2 5.0 5
6 2 4.5 6

Related

There are different results between subset code in R [duplicate]

This question already has an answer here:
R subset with condition using %in% or ==. Which one should be used? [duplicate]
(1 answer)
Closed 2 years ago.
The results of:
BB= RB[RB$Rep, %in% c(“1”,”3”)] and
Bb=subset(RB,Rep ==c(“1”,”3”) )
are different.
Please tell me what the problem is?
When you use == the comparison is done in a sequential order.
Consider this example :
df <- data.frame(a = 1:6, b = c(1:3, 3:1))
df
# a b
#1 1 1
#2 2 2
#3 3 3
#4 4 3
#5 5 2
#6 6 1
When you use :
subset(df, b == c(1, 3))
# a b
#1 1 1
#4 4 3
1st value of b is compared with 1, 2nd with 3. Now as you have vector of shorter length, the values are recycled meaning 3rd value is again compared to 1, 4th value with 3 and so on until end of the dataframe. Hence, you get row 1 and 4 as output here.
When you use %in% it checks for either 1 or 3 is present in b. So it selects all the rows where value 1 or 3 is present in b.
subset(df, b %in% c(1, 3))
# a b
#1 1 1
#3 3 3
#4 4 3
#6 6 1

Replacing a specific numeric value within a Data-Frame with N/A [duplicate]

This question already has answers here:
Replace all 0 values to NA
(11 answers)
Closed 4 years ago.
I'm currently working on an updated version of a dataset that I have worked on previously in R-Studio. The new update features coding for missing values rather than leaving these cells blank.
The issue with this coding is that they are numeric values which interfere with my analysis and modelling specifically looking at values of age (also numeric in this column), skewing my models.
I am looking for a way to replace values that are specifically coded as missing (e.g. the code for a missing value is 9998) with N/A within the dataframe of R-Studio.
Something like this, perhaps?
d <- data.frame(x = 1:5,y = letters[1:5],z = c(NA,1:4))
> d$x[3] <- 9998
> d
x y z
1 1 a NA
2 2 b 1
3 9998 c 2
4 4 d 3
5 5 e 4
> d[d == 9998] <- NA
> d
x y z
1 1 a NA
2 2 b 1
3 NA c 2
4 4 d 3
5 5 e 4

Apply a maximum value to whole group [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 6 years ago.
I have a df like this:
Id count
1 0
1 5
1 7
2 5
2 10
3 2
3 5
3 4
and I want to get the maximum count and apply that to the whole "group" based on ID, like this:
Id count max_count
1 0 7
1 5 7
1 7 7
2 5 10
2 10 10
3 2 5
3 5 5
3 4 5
I've tried pmax, slice etc. I'm generally having trouble working with data that is in interval-specific form; if you could direct me to tools well-suited to that type of data, would really appreciate it!
Figured it out with help from Gavin Simpson here: Aggregate a dataframe on a given column and display another column
maxcount <- aggregate(count ~ Id, data = df, FUN = max)
new_df<-merge(df, maxcount)
Better way:
df$max_count <- with(df, ave(count, Id, FUN = max))

split a dataframe with numbers separated by the add sign '+' into new rows [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
Sorry for the naive question but I have a dataframe like this:
n sp cap
1 1 a 3
2 2 b 3+2+4
3 3 c 2
4 4 d 1+5
I need to split the numbers separated by the add sign ("+") into new rows in order to the get a new dataframe like this below:
n sp cap
1 1 a 3
2 2 b 3
3 2 b 2
4 2 b 4
5 3 c 2
6 4 d 1
7 4 d 5
How can I do that? strsplit?
thanks in advance
We could use cSplit from splitstackshape
library(splitstackshape)
cSplit(df1, 'cap', sep="+", 'long')
# n sp cap
#1: 1 a 3
#2: 2 b 3
#3: 2 b 2
#4: 2 b 4
#5: 3 c 2
#6: 4 d 1
#7: 4 d 5
Or could do this in base R. Use strsplit to split the elements of "cap" column to substrings, which returns a list (lst), Replicate the rows of dataset by the length of each list element, subset the dataset based on the new index, convert the "lst" elements to "numeric", unlist, and cbind with the modified dataset.
lst <- strsplit(as.character(df1$cap), "[+]")
df2 <- cbind(df1[rep(1:nrow(df1), sapply(lst, length)),1:2],
cap= unlist(lapply(lst, as.numeric)))

Duplicating data frame rows by freq value in same data frame [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 years ago.
I have a data frame with names by type and their frequencies. I'd like to expand this data frame so that the names are repeated according to their name-type frequency.
For example, this:
> df = data.frame(name=c('a','b','c'),type=c(0,1,2),freq=c(2,3,2))
name type freq
1 a 0 2
2 b 1 3
3 c 2 2
would become this:
> df_exp
name type
1 a 0
2 a 0
3 b 1
4 b 1
5 b 1
6 c 2
7 c 2
Appreciate any suggestions on a easy way to do this.
You can just use rep to "expand" your data.frame rows:
df[rep(sequence(nrow(df)), df$freq), c("name", "type")]
# name type
# 1 a 0
# 1.1 a 0
# 2 b 1
# 2.1 b 1
# 2.2 b 1
# 3 c 2
# 3.1 c 2
And there's a function expandRows in the splitstackshape package that does exactly this. It also has the option to accept a vector specifying how many times to replicate each row, for example:
expandRows(df, "freq")

Resources