I am trying to spread() a couple of key/value pairs but the common value column does not collapse. I think that it may have to do with some previous processing, or more likely I do not know the right way to spread two or more key/value pairs to get the result I expect.
I'm starting with this data set:
library(tidyverse)
df <- tibble(order = 1:7,
line_1 = c(23,8,21,45,68,31,24),
line_2 = c(63,25,25,24,48,24,63),
line_3 = c(62,12,10,56,67,25,35))
There are 2 pre-spread steps to define order of the "count" values created in the following gather() function. This is the first pre-spread step to define the original order of the "count" variable using the row number:
ntrl <- df %>%
gather(line_1,
line_2,
line_3,
key = "sector",
value = "count") %>%
group_by(order) %>%
mutate(sector_ord = row_number()) %>%
arrange(order,
sector)
This is the second pre-spread step to define the numerical order of the "count" variable:
ord <- ntrl %>%
arrange(order,
count) %>%
group_by(order) %>%
mutate(num_ord = paste0("ord_",
row_number(),
sep=""))
And then finally the spread code that I have been using:
wide <- ord %>%
group_by(order) %>%
spread(key = sector,
value = count) %>%
spread(key = num_ord,
value = sector_ord)
What I'm getting is this:
order line_1 line_2 line_3 ord_1 ord_2 ord_3
1 1 23 NA NA 1 NA NA
2 1 NA 63 NA NA NA 2
3 1 NA NA 62 NA 3 NA
4 2 8 NA NA 1 NA NA
5 2 NA 25 NA NA NA 2
6 2 NA NA 12 NA 3 NA
7 3 21 NA NA NA 1 NA
8 3 NA 25 NA NA NA 2
9 3 NA NA 10 3 NA NA
... and so on thru 21 lines accounting for all 7 "order" lines
The behavior that I am expecting is that the "order" column would collapse in all rows that are the same "order" value to give the following:
order line_1 line_2 line_3 ord_1 ord_2 ord_3
1 1 23 63 62 1 3 2
2 2 8 25 12 1 3 2
3 3 21 25 10 2 3 1
4 4 45 24 56 2 1 3
... and so on, I think that paints the picture
I have reviewed the questions and answers about spreading with duplicate identifiers and the use of the index of row numbers but that does not help.
I figure that it has something to do with the double spreading, but I cannot figure out how to do that.
Thanks for your help.
A solution using tidyverse starting your df. The key is to use summarise_all(funs(.[which(!is.na(.))])) to select the only non-NA value for each column.
library(tidyverse)
df2 <- df %>%
gather(Lines, Value, -order) %>%
group_by(order) %>%
mutate(Rank = dense_rank(Value),
RankOrder = paste0("ord_", row_number())) %>%
spread(Lines, Value) %>%
spread(RankOrder, Rank) %>%
summarise_all(funs(.[which(!is.na(.))]))
df2
# A tibble: 7 x 7
order line_1 line_2 line_3 ord_1 ord_2 ord_3
<int> <dbl> <dbl> <dbl> <int> <int> <int>
1 1 23 63 62 1 3 2
2 2 8 25 12 1 3 2
3 3 21 25 10 2 3 1
4 4 45 24 56 2 1 3
5 5 68 48 67 3 1 2
6 6 31 24 25 3 1 2
7 7 24 63 35 1 3 2
Starting from df:
df %>%
gather(headers, line, -order) %>%
separate(headers, into = c('dummy', 'rn')) %>%
select(-dummy) %>%
group_by(order) %>%
mutate(ord = rank(line, ties.method='first')) %>%
{data.table::dcast(setDT(.), order ~ rn, value.var = c("line", "ord"))}
# order line_1 line_2 line_3 ord_1 ord_2 ord_3
#1: 1 23 63 62 1 3 2
#2: 2 8 25 12 1 3 2
#3: 3 21 25 10 2 3 1
#4: 4 45 24 56 2 1 3
#5: 5 68 48 67 3 1 2
#6: 6 31 24 25 3 1 2
#7: 7 24 63 35 1 3 2
Related
I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2
I am trying to convert from long to wide format but multiple columns denote the unique rows.
In the example below, the block, density, species columns denote the unique individuals. Every individual has 2 or 3 rows associated with area and size. I want to convert the area and size to wide format.
This is my dataset
block <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
species <- c("A","A","A","A","B","B","B","B","A","A","A","A","B","B","B","B","B")
den <- c("20","20","50","50","20","20","50","50","20","20","50","50","20","20","50","50","50")
block <- as.factor(block)
den <- as.factor(den)
species <- as.factor(species)
area <- c(1:17)
size <- c(17:33)
df <- data.frame(block, species, den, area, size)
I want to final dataset with only the unique individuals
block species den area.1 area.2 area.3 size.1 size.2 size.3
1 A 20 1 2 NA 17 18 NA
1 A 50 3 4 NA 19 20 NA
.....
2 B 50 15 16 17 31 32 33
Note: Other answers that I have persued do not use multiple columns to denote the uniquness of the rows
We can use pivot_wider after creating a sequence column by group
library(dplyr)
library(tidyr)
df %>%
group_by(block, species, den) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = rn, values_from = c(area, size), names_sep = ".")
# A tibble: 8 x 9
# block species den area.1 area.2 area.3 size.1 size.2 size.3
# <fct> <fct> <fct> <int> <int> <int> <int> <int> <int>
#1 1 A 20 1 2 NA 17 18 NA
#2 1 A 50 3 4 NA 19 20 NA
#3 1 B 20 5 6 NA 21 22 NA
#4 1 B 50 7 8 NA 23 24 NA
#5 2 A 20 9 10 NA 25 26 NA
#6 2 A 50 11 12 NA 27 28 NA
#7 2 B 20 13 14 NA 29 30 NA
#8 2 B 50 15 16 17 31 32 33
I am monitoring an animal population. I have their individual IDs as numbers, the date they were encountered on, and the number of individuals encountered on that day. I want to sum up the total number of different individuals encountered as the days go by, so I need it to recognize same IDs and only add new individuals to the total encountered.
This is my dataset, the last column being my desired outcome:
Month Day ID N. individuals that day Total encountered
5 13 44 3 3
5 13 58 3 3
5 13 57 3 3
5 14 58 1 3
5 15 44 2 4
5 15 06 2 4
Edit - updated to working, but inelegant, solution. The process here was to use padr to create a row for every ID in every date, with a 1 once it appears. Then we can count how many IDs have appeared as of each date, and add that to the original with a join.
library(tidyverse); library(lubridate)
# First, make a date column for easier sorting etc.
df1 <- df %>%
mutate(date = ymd(paste(2019, Month, Day))) %>%
select(date, ID) %>%
mutate(appearance = 1) # For counting later; if missing = NA in padded version
df2 <- df1 %>%
padr::pad(group = "ID", start_val = min(df1$date), end_val = max(df1$dat)) %>%
fill(appearance) %>%
count(date, Month = month(date), Day = day(date),
wt = appearance, name = "Total_encountered_calc")
df %>%
left_join(df2)
Output
Month Day ID N_individuals_that_day Total_encountered date Total_encountered_calc
1 5 13 44 3 3 2019-05-13 3
2 5 13 58 3 3 2019-05-13 3
3 5 13 57 3 3 2019-05-13 3
4 5 14 58 1 3 2019-05-14 3
5 5 15 44 2 4 2019-05-15 4
6 5 15 6 2 4 2019-05-15 4
An option
library(tidyverse)
df %>%
add_count(Month, Day) %>%
mutate(n1 = duplicated(ID)) %>%
group_by(Month, Day) %>%
mutate(n1 = c(min(n - n1), rep(0, n()-1))) %>%
ungroup %>%
mutate(n1 = cumsum(n1))
# A tibble: 6 x 5
# Month Day ID n n1
# <int> <int> <int> <int> <dbl>
#1 5 13 44 3 3
#2 5 13 58 3 3
#3 5 13 57 3 3
#4 5 14 58 1 3
#5 5 15 44 2 4
#6 5 15 6 2 4
Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA
I have two dataframes of this format.
df1:
id x y
1 2 3
2 4 5
3 6 7
4 8 9
5 1 1
df2:
id id2 v v2
1 t 11 21
1 b 12 22
2 t 13 23
2 b 14 24
3 t 15 25
3 b 16 26
4 b 17 27
Hence, sometimes, the id in main 'df' will appear twice (maximum) sometimes once, and sometimes not at all. The expected result would be:
df_merged:
id x y v.t v2.t v.b v2.b
1 2 3 11 21 12 22
2 4 5 13 23 24 24
3 6 7 15 25 16 26
4 8 9 NA NA 17 27
5 1 1 NA NA NA NA
I have used merge but due to the fact that id2 in df2 doesn't match, I get two instances of id in df_merged like so:
id x y v v2
1 ...
1 ...
Thanks in advance!
We can start by adjusting df2 to the right format then do a normal joining.
librar(dplyr)
library(tidyr)
df2 %>% gather(key,val,-id,-id2) %>% #Transfer from wide to long format for v and v2
mutate(new_key=paste0(key,'.',id2)) %>% #Create a new id2 as new_key
select(-id2,-key) %>% #de-select the unnessary columns
spread(new_key,val) %>% #Transfer back to wide foramt with right foramt for id
right_join(df1) %>% #right join df1 "To includes all rows in df1" using id
select(id,x,y,v.t,v2.t,v.b,v2.b) #rearrange columns name
Joining, by = "id"
id x y v.t v2.t v.b v2.b
1 1 2 3 11 21 12 22
2 2 4 5 13 23 14 24
3 3 6 7 15 25 16 26
4 4 8 9 NA NA 17 27
5 5 1 1 NA NA NA NA
You can solve this just using merge. Split df2 based on whether id2 equals b or t. Merge these two new objects with df1, and finally merge them together. The code includes one additional step to also include data found in df1 but not df2.
dfb <- merge(df1, df2[df2$id2=='b',], by='id')
dft <- merge(df1, df2[df2$id2=='t',], by='id')
dfRest <- df1[!df1$id %in% df2$id,]
dfAll <- merge(dfb[,c('id','x','y','v','v2')], dft[,c('id','v','v2')], by='id', all.x=T)
merge(dfAll, dfRest, all.x=T, all.y=T)
id x y v.x v2.x v.y v2.y
1 1 2 3 12 22 11 21
2 2 4 5 14 24 13 23
3 3 6 7 16 26 15 25
4 4 8 9 17 27 NA NA
5 5 1 1 NA NA NA NA