How can I operate on elements of a data.frame in r, that creates a new column? [duplicate] - r

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 7 years ago.
Suppose I have a data.frame, df.
a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
I'd like to operate on it so that for all places where a and b are equal, I compute the mean of d.
I found that using aggregate can do this,
aggregate(d ~ a + b, df, mean)
This gives me something reasonable
a b d
1 2 5
2 1 3
2 3 6
But I would ideally like to keep my original d column, and add a new column m, so that I get the original data.frame with a new column "m" that contains the averages like,
a b d m
1 2 4 5
1 2 5 5
1 2 6 5
2 1 5 3
2 3 6 6
2 1 1 3
Any ideas on how to do this "properly" in R?

library(dplyr)
df <- read.table(text = "a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
" , header = T)
df %>%
group_by(a , b) %>%
mutate(m = mean(d))

Related

R: Create duplicate rows based on a variable (dplyr preferred) [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 3 years ago.
I'd like to create a new list with duplicate entries based upon an existing list in R. I'm trying to use tidyverse as much as possible, so dplyr would be preferred.
Say I have a list of times where sales occured:
df <- data.frame(time = c(0,1,2,3,4,5), sales = c(1,1,2,1,1,3))
> df
time sales
1 0 1
2 1 1
3 2 2
4 3 1
5 4 1
6 5 3
And I'd like instead to have a list with an entry for each sale:
ans <- data.frame(salesTime = c(0,1,2,2,3,4,5,5,5))
> ans
salesTime
1 0
2 1
3 2
4 2
5 3
6 4
7 5
8 5
9 5
I found an interesting example using dplyr here: Create duplicate rows based on conditions in R
But this will only allow me to create one new row when sales == n, and not create n new rows when sales == n.
Any help would be greatly appreciated.
A nice tidyr function for this is uncount():
df %>%
uncount(sales) %>%
rename(salesTime = time)
salesTime
1 0
2 1
3 2
3.1 2
4 3
5 4
6 5
6.1 5
6.2 5
data.frame(salesTime = rep(df$time, df$sales))
# salesTime
#1 0
#2 1
#3 2
#4 2
#5 3
#6 4
#7 5
#8 5
#9 5
If you like dplyr and pipes you can go for:
df %>% {data.frame(salesTime = rep(.$time, .$sales))}
df %>% rowwise %>% mutate(time=list(rep(time,sales))) %>% unnest
## A tibble: 9 x 2
# sales time
# <dbl> <dbl>
#1 1 0
#2 1 1
#3 2 2
#4 2 2
#5 1 3
#6 1 4
#7 3 5
#8 3 5
#9 3 5

Split dataframe based on one column in r, with a non-fixed width column [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a problem that is an extension of a well-covered issue here on SE. I.e:
Split a column of a data frame to multiple columns
My data has a column with a string format, comma-separated, but of no fixed length.
data = data.frame(id = c(1,2,3), treatments = c("1,2,3", "2,3", "8,9,1,2,4"))
So I would like to have my dataframe eventually be in the proper tidy/long form of:
id treatments
1 1
1 2
1 3
...
3 1
3 2
3 4
Something like separate or strsplit doesn't seem on it's own to be the solution. Separate fails with warnings that various columns have too many values (NB id 3 has more values than id 1).
Thanks
You can use tidyr::separate_rows:
library(tidyr)
separate_rows(data, treatments)
# id treatments
#1 1 1
#2 1 2
#3 1 3
#4 2 2
#5 2 3
#6 3 8
#7 3 9
#8 3 1
#9 3 2
#10 3 4
Using dplyr and tidyr packages:
data %>%
separate(treatments, paste0("v", 1:5)) %>%
gather(var, treatments, -id) %>%
na.exclude %>%
select(id, treatments) %>%
arrange(id)
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4
You can also use unnest:
library(tidyverse)
data %>%
mutate(treatments = stringr::str_split(treatments, ",")) %>%
unnest()
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4

How to randomly choose only one row in each group [duplicate]

This question already has answers here:
from data table, randomly select one row per group
(4 answers)
Closed 6 years ago.
Say I have a dataframe as follows:
df <- data.frame(Region = c("A","A","A","B","B","C","D","D","D","D"),
Combo = c(1,2,3,1,2,1,1,2,3,4))
> df
Region Combo
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 D 1
8 D 2
9 D 3
10 D 4
What I would like to do, is for each Region (A,B,C,D) randomly choose only one of the possible combos for that region.
If the chosen combination were indicated by a binary variable, it would look something potentially like this:
Region Combo RandomlyChosen
1 A 1 1
2 A 2 0
3 A 3 0
4 B 1 0
5 B 2 1
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
I'm aware of the sample function, but just don't know how to choose only one combo within each region.
I reglarly use data.table, so any solutions using that are welcome. Though solutions not using data.table are equally welcome.
Thanks!
In plain R you can use sample() within tapply():
df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
Region Combo Chosen
1 A 1 0
2 A 2 1
3 A 3 0
4 B 1 1
5 B 2 0
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

How to add index of a List item after melt() in R [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 7 years ago.
I am working with a list as follows:
> l <- list(c(2,4,9), c(4,2,6,1))
> m <- melt(l)
> m
value L1
2 1
4 1
9 1
4 2
2 2
6 2
1 2
i want to add index i for my resulting data frame m looks like this:
> m
i value L1
1 2 1
2 4 1
3 9 1
1 4 2
2 2 2
3 6 2
4 1 2
i indicating 3 values belongs to first list element and 4 values belongs to the second list element.
How can i archive it please, can anyone help?
Just for completeness, some other options
data.table (which is basically what getanID is doing)
library(data.table)
setDT(m)[, i := seq_len(.N), L1]
dplyr
library(dplyr)
m %>%
group_by(L1) %>%
mutate(i = row_number())
Base R (from comments by #user20650)
transform(m, i = ave(L1, L1, FUN = seq_along))
You could use splitstackshape
library(splitstackshape)
getanID(m, 'L1')[]
# value L1 .id
#1: 2 1 1
#2: 4 1 2
#3: 9 1 3
#4: 4 2 1
#5: 2 2 2
#6: 6 2 3
#7: 1 2 4
Or using base R
transform(stack(setNames(l, seq_along(l))), .id= rapply(l, seq_along))
Less elegant than ave but does the work:
transform(m, i=unlist(sapply(rle(m$L1)$length, seq_len)))
# value L1 i
#1 2 1 1
#2 4 1 2
#3 9 1 3
#4 4 2 1
#5 2 2 2
#6 6 2 3
#7 1 2 4
Or
m$i <- sequence(rle(m$L1)$lengths)

Resources