I'm new to programming in R and I have the following dataframe:
A B C D E
1 3 0 4 5 0
2 0 0 5 1 0
3 2 1 2 0 3
I would like to get a new dataframe containing the indices of the n max values of each row, e.g: If I wanted the column indices of the 3 biggest values in each row (n=3), I want my new dataframe to be like this:
F G H
1 1 3 4
2 1 3 4
3 1 3 5
So in the first row of this dataframe containts the column indices of the 3 biggest values of row 1 in the original dataframe. And so on.
My original idea was to write a loop with which.max, but that seems way too long and ineffective. Does anyone have a better idea?
We can use apply
t(apply(df1, 1, function(x) sort(head(seq_along(x)[order(-x)], 3))))
# [,1] [,2] [,3]
#1 1 3 4
#2 1 3 4
#3 1 3 5
Or using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
group_by(rn) %>%
mutate(ind = row_number()) %>%
arrange(rn, desc(value)) %>%
slice(n = 1:3) %>%
select(-name, -value) %>%
arrange(rn, ind) %>%
mutate(nm1 = c("F", "G", "H")) %>%
ungroup %>%
pivot_wider(names_from = nm1, values_from = ind)
data
df1 <- structure(list(A = c(3L, 0L, 2L), B = c(0L, 0L, 1L), C = c(4L,
5L, 2L), D = c(5L, 1L, 0L), E = c(0L, 0L, 3L)), class = "data.frame",
row.names = c("1",
"2", "3"))
Related
I am working with a dataframe with thousands of responses to questions about interest in a set of resources. I want to summarize how many participants are interested in a given resource by counting the number of positive responses (coded as "1").
As a final step, I would like to suppress any answer where <5 participants responded.
I've created code that works, but its clunky when I'm dealing with dozens of fields. So, I'm looking for suggestions for a more streamlined approach, perhaps using piping or dplyr?
Example Input
ID
Resource1
Resource2
Resource3
Resource4
1
1
0
1
1
2
0
0
0
1
3
1
0
0
0
4
0
0
0
0
5
1
1
1
1
Desired output
Interested
Not Interested
Resource1
3
2
Resource2
1
4
Resource3
2
3
Resource4
3
2
My (ugly) code
###Select and summarise relevent columns
resource1 <- df %>% drop_na(resource1) %>% group_by(resource1) %>% summarise(n=n()) %>% rename(resp=resource1, r1 =n)
resource2 <- df %>% drop_na(resource2) %>% group_by(resource2) %>% summarise(n=n()) %>% rename(resp=resource2, r2 =n)
resource3 <- df %>% drop_na(resource3) %>% group_by(resource3) %>% summarise(n=n()) %>% rename(resp=resource3, r3 =n)
resource4 <- df %>% drop_na(resource4) %>% group_by(resource4) %>% summarise(n=n()) %>% rename(resp=resource4, r4 =n)
###Merge summarised data
resource_sum <-join_all(list(resource1,resource2,resource3,resource4), by=c("resp"))
###Replace all values less than 5 with NA per suppression rules.
resource_sum <- apply(resource_sum, function(x) ifelse(x<5, "NA", x))
resource_sum <-as.data.frame(resource_sum)
We may reshape into 'long' format with pivot_longer and then do a group by summarise to get the count of 1s and 0s
library(dplyr)
library(tidyr)
library(tibble)
df %>%
pivot_longer(cols = -ID) %>%
group_by(name) %>%
summarise(Interested = sum(value), NotInterested = n() - Interested) %>%
column_to_rownames('name')
-output
Interested NotInterested
Resource1 3 2
Resource2 1 4
Resource3 2 3
Resource4 3 2
Or using base R
v1 <- colSums(df[-1])
cbind(Interested = v1, NotInterested = nrow(df) - v1)
-output
Interested NotInterested
Resource1 3 2
Resource2 1 4
Resource3 2 3
Resource4 3 2
data
df <- structure(list(ID = 1:5, Resource1 = c(1L, 0L, 1L, 0L, 1L),
Resource2 = c(0L,
0L, 0L, 0L, 1L), Resource3 = c(1L, 0L, 0L, 0L, 1L), Resource4 = c(1L,
1L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L
))
You can use table to get counts of 0 and 1 value. To apply the function (table) to multiple columns you can use sapply -
t(sapply(df[-1], table))
# 0 1
#Resource1 2 3
#Resource2 4 1
#Resource3 3 2
#Resource4 2 3
I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0
I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))
I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3
Change below data
pos BZ_SP BZ_SP_m1 BZ_SP_m2 CL_SP CL_SP_m1 CL_SP_m2
1 -300000 2 3 2540544 1 2
2 0 0 0 -118621 3 4
to look this way
CurveGroup SpreadId SpreadMonth1 SpreadMonth2 Position
BZ_SP 1 2 3 -300000
CL_SP 1 1 2 2540544
BZ_SP 2 0 0 0
CL_SP 2 3 4 -118621
gather the input into long form and then separate the variable into Curvegroup and suffix. spread it back out to wide form. Rename and rearrange the columns.
library(dplyr)
library(tidyr)
DF %>%
gather(variable, value, -pos) %>%
separate(variable, c("CurveGroup", "suffix"), sep = 5, fill = "right") %>%
spread(suffix, value) %>%
select(CurveGroup, SpreadId = "pos", SpreadMonth1 = "_m1", SpreadMonth2 = "_m2",
Position = "V1")
giving:
CurveGroup SpreadId SpreadMonth1 SpreadMonth2 Position
1 BZ_SP 1 2 3 -300000
2 CL_SP 1 1 2 2540544
3 BZ_SP 2 0 0 0
4 CL_SP 2 3 4 -118621
Note: The input DF in reproducible form is:
DF <- structure(list(pos = 1:2, BZ_SP = c(-300000L, 0L), BZ_SP_m1 = c(2L,
0L), BZ_SP_m2 = c(3L, 0L), CL_SP = c(2540544L, -118621L), CL_SP_m1 = c(1L,
3L), CL_SP_m2 = c(2L, 4L)), .Names = c("pos", "BZ_SP", "BZ_SP_m1",
"BZ_SP_m2", "CL_SP", "CL_SP_m1", "CL_SP_m2"),
class = "data.frame", row.names = c(NA, -2L))
Update: Simplified.