create multiple columns at once, depending on the number of columns of another df, with dplyr - r

I want to create with dplyr a df with n columns (depending on the number of columns of df data), where the root of the name is the same TIME. and first the column is equal to 1 in all rows, the second equal to 2 and so on. The number of rows is the same as data
data <- data.frame(ID=c(1:6), VALUE.1=c(2,5,7,1,3,5), VALUE.2=c(1,7,2,4,5,4), VALUE.3=c(9,2,6,3,4,4), VALUE.4=c(1,2,3,2,3,8))
And the first column of data as first column. This is what I'd like to have:
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
6 1 2 3 4
Now I'm doing:
T1 <- data.frame(ID=unique(data$ID), TIME.1=rep(1, length(unique(data$ID))), TIME.2=rep(2, length(unique(data$ID))), TIME.3=rep(3, length(unique(data$ID))), TIME.4=rep(4, length(unique(data$ID))) )

We can replace the column contents with the suffix in the column name, then rename the columns from VALUE.n to TIME.n.
library(dplyr)
data %>%
mutate(across(starts_with("VALUE"), ~sub("VALUE.", "", cur_column()))) %>%
rename_with(~sub("VALUE", "TIME", .x))
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 1 2 3 4
2 2 1 2 3 4
3 3 1 2 3 4
4 4 1 2 3 4
5 5 1 2 3 4
6 6 1 2 3 4

Here is a base R approach that may give a similar result. This involves creating a matrix based on your other data.frame data, using its dimensions for column names and determining the number of rows. We subtract 1 from number of columns given the first ID column present.
nc <- ncol(data) - 1
nr <- nrow(data)
as.data.frame(cbind(
ID = data$ID,
matrix(1:nc, ncol = nc, nrow = nr, byrow = T, dimnames = list(NULL, paste0("TIME.", 1:nc)))
))
Output
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 1 2 3 4
2 2 1 2 3 4
3 3 1 2 3 4
4 4 1 2 3 4
5 5 1 2 3 4
6 6 1 2 3 4

Related

Using loops with mutate in R to sum columns with partially matching column names

df <- data.frame(x_1_jr=c(1,2,3,4), x_2_jr=c(1,2,3,4), y_1_jr=c(4,3,2,1), y_2_jr=c(4,3,2,1)
x_1_jr x_2_jr y_1_jr y_2_jr
1 1 1 4 4
2 2 2 3 3
3 3 3 2 2
4 4 4 1 1
I am trying to generate new variables that are the sum of x and y with the same column name suffix, i.e.
df <- df %>% mutate(z_1_jr= x_1_jr + y_1_jr)
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr
1 1 1 4 4 5
2 2 2 3 3 5
3 3 3 2 2 5
4 4 4 1 1 5
I could write this out for each variable combination, but I have a large number of variables(>50 for each x and y group), and would like to use a loop... however, I'm relatively new to R and am not sure where to begin!
Can someone help? Thank you!
EDIT: for additional clarity, the dataset contains other non-numeric variables. There are >700 columns (from a large survey). x_1_jr represents, for example, the number of male individuals ages 1 year, y_1_jr female individuals of 1 year. I am trying to get a total (male plus female of 1 year) for each age group.
A
An option with base R
df[c("z_1_jr", "z_2_jr")] <- sapply(split.default(df,
sub("^[a-z]+_", "", names(df))), rowSums)
df
# x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
#1 1 1 4 4 5 5
#2 2 2 3 3 5 5
#3 3 3 2 2 5 5
#4 4 4 1 1 5 5
One dplyr and purrr option could be:
df %>%
bind_cols(map_dfc(.x = unique(sub(".*?_", "_", names(df))),
~ df %>%
transmute(!!paste0("z", .x) := rowSums(select(., ends_with(.x))))))
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
1 1 1 4 4 5 5
2 2 2 3 3 5 5
3 3 3 2 2 5 5
4 4 4 1 1 5 5

Subseting data frame based on multiple criteria for deletion of rows

Consider the following data frame consisting of column names "id" and "x", where each id is repeated four times. Data is as follows:
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
The question is about how to subset the data frame by the following criteria:
(1) keep all entries of each id, if its corresponding values in column x does not contain 3 or it has 3 as the last number.
(2) for a given id with multiple 3s in column x, keep all the numbers up to the first 3 and delete the remaining 3s. The expected output would look like:
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
I am familiar with the use of the 'filter' function in dplyr package to subset data, but this particular situation confuses me because of the complexity of the above criteria. Any help on this would be greatly appraciated.
Here's one solution that uses / creates some new columns to help you filter on:
library(dplyr)
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
df %>%
group_by(id) %>% # for each id
mutate(num_threes = sum(x == 3), # count number of 3s
flag = ifelse(unique(num_threes) > 0, # if there is a 3
min(row_number()[x == 3]), # keep the row of the first 3
0)) %>% # otherwise put a 0
filter(num_threes == 0 | row_number() <= flag) %>% # keep ids with no 3s or up to first 3
ungroup() %>%
select(-num_threes, -flag) # remove helpful columns
# # A tibble: 13 x 2
# id x
# <dbl> <dbl>
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 1
# 5 2 2
# 6 2 3
# 7 3 1
# 8 3 2
# 9 3 2
# 10 3 3
# 11 4 2
# 12 4 2
# 13 4 3
this works for me:
data
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
commands
library(dplyr)
df <- mutate(df, before = lag(x))
df$condition1 <- 1
df$condition1[df$x == 3 & df$before == 3] <- 0
final_df <- df[df$condition1 == 1, 1:2]
result
x id
1 2
1 2
1 1
1 1
2 2
2 3
3 1
3 2
3 2
3 3
4 2
4 2
4 3`
One idea is to pick out the rows with x==3 and use unique() over them. Then append the unique rows with just single 3 to the rest part of the data frame, and finally order the rows.
Here is a solution with base R for the idea above:
res <- (r <- with(df,rbind(df[x!=3,],unique(df[x==3,]))))[order(as.numeric(rownames(r))),]
rownames(res) <- seq(nrow(res))
which give
> res
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
DATA
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

Get columns in frame based on values in second frame

I have 2 dataframes. One has a ID column with alot of arranged IDs.
The other one has just specific rows of the first column. Those are my markers.
I need to get the sum of the of the values in a specific column based on the id values of the second column.
The first column may be
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
the second one:
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
what i need to get:
id goals cards group points
1 2 2 1 2-(2+2)
2 3 2 1 0 cause in second list
3 4 2 1 4-(2+1+2)
4 5 1 1 5-(1+2)
5 1 2 1 0 cause in second list
1 2 2 2 2-(2+2)
2 3 2 2 0
3 4 2 2 0
4 5 1 3 5-(1+2)
5 1 2 3 0
Something like: ??
df1<- df1%>%
rowwise() %>%
mutate(points=
goals
-(sum( df1$cards[df1$id <= df2$id & df1$id>df1$id])))
df1 = read.table(text = "
id goals cards
1 2 2
2 3 2
3 4 2
4 5 1
5 1 2
", header=T)
df2 = read.table(text = "
id goals cards
2 3 2
5 1 2
", header=T)
library(dplyr)
# function that gets an id and returns the sum of cards based on df2
GetSumOfCards = function(x) {
ids = min(df2$id[df2$id >= x]) # for a given id of df1 find the minimum id in df2 that is bigger than this id
ifelse(x %in% df2$id, # if the given id exists in df2
0, # sum of cards is zero
sum(df1$cards[df1$id >= x & df1$id <= ids])) # otherwise get sum of cards in df1 from this id until the id obtained before
}
# update function to be vectorised
GetSumOfCards = Vectorize(GetSumOfCards)
df1 %>%
mutate(sum_cards = GetSumOfCards(id), # get sum of cards for each id using the function
points = goals - sum_cards) # get the points
# id goals cards sum_cards points
# 1 1 2 2 4 -2
# 2 2 3 2 0 3
# 3 3 4 2 5 -1
# 4 4 5 1 3 2
# 5 5 1 2 0 1
Based on your updated question, applying a similar function to every row makes the process very slow. So, this solution groups data in a way that you can just count the cards on chunks of data/rows:
df1 = read.table(text = "
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
", header=T)
df2 = read.table(text = "
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
", header=T)
library(dplyr)
df1 %>%
arrange(group, desc(id)) %>% # order by group and id descending (this will help with counting the cards)
left_join(df2 %>% # join specific columns of df2 and add a flag to know that this row exists in df2
select(id, group) %>%
mutate(flag = 1), by=c("id","group")) %>%
mutate(flag = ifelse(is.na(flag), 0, flag), # replace NA with 0
flag2 = cumsum(flag)) %>% # this flag will create the groups we need to count cards
group_by(group, flag2) %>% # for each new group (we need both as the card counting will change when we have a row from df2, or if group changes)
mutate(sum_cards = ifelse(flag == 1, 0, cumsum(cards))) %>% # get cummulative sum of cards unless the flag = 1, where we need 0 cards
ungroup() %>% # forget the grouping
arrange(group, id) %>% # back to original order
mutate(points = goals - sum_cards) %>% # calculate points
select(-flag, -flag2) # remove flags
# # A tibble: 10 x 6
# id goals cards group sum_cards points
# <int> <int> <int> <int> <dbl> <dbl>
# 1 1 2 2 1 4 -2
# 2 2 3 2 1 0 3
# 3 3 4 2 1 5 -1
# 4 4 5 1 1 3 2
# 5 5 1 2 1 0 1
# 6 1 2 2 2 4 -2
# 7 2 3 2 2 0 3
# 8 3 4 2 2 0 4
# 9 4 5 1 3 3 2
# 10 5 1 2 3 0 1

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

Resources