Add missing index in a dataframe - r

Hi I have a messy data frame as follows:
df <- data.frame(age.band = c("0-5","5-10"), beg.code = c("A1","B1"), end.code=c("A5","B3"),value = c(10,5))
age.band beg.code end.code value
0-5 A1 A5 10
5-10 B1 B3 5
I would like to transform it into a friendlier format such as:
index age.band value
A1 0-5 10
A2 0-5 10
A3 0-5 10
A4 0-5 10
A5 0-5 10
B1 5-10 5
B2 5-10 5
B3 5-10 5
Can anyone help me to find a way to add all the missing indexes for this dataframe? Thanks

A solution using dplyr and tidyr. Nptice that I added stringsAsFactors = FALSE to avoid creating factor columns when creating your example data frame. If you run the code on your original data frame, you will receive warning message due to the factor columns, but it will not affect the end results.
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Code, Value, ends_with("code")) %>%
extract(Value, into = c("Group", "Index"), regex = "([A-Za-z+].*)([\\d].*$)",
convert = TRUE) %>%
select(-Code) %>%
group_by(Group) %>%
complete(Index = full_seq(Index, period = 1)) %>%
unite(Index, c("Group", "Index"), sep = "") %>%
fill(-Index)
df2
# # A tibble: 8 x 3
# Index age.band value
# * <chr> <chr> <dbl>
# 1 A1 0-5 10
# 2 A2 0-5 10
# 3 A3 0-5 10
# 4 A4 0-5 10
# 5 A5 0-5 10
# 6 B1 5-10 5
# 7 B2 5-10 5
# 8 B3 5-10 5
DATA
df <- data.frame(age.band = c("0-5","5-10"), beg.code = c("A1","B1"), end.code=c("A5","B3"),value = c(10,5),
stringsAsFactors = FALSE)

Here is one option with base R. The idea is to remove the non-numeric characters from the 'code' columns, convert it to numeric and get the sequence stored as a list. Then, paste the non-numeric characters and finally, based on the lengths of the list, expand the rows of the original dataset with rep and create a new column 'index' by unlisting the list
lst <- do.call(Map, c(f = `:`, lapply(df[2:3], function(x) as.numeric(sub("\\D+", "", x)))))
lst1 <- Map(paste0, substr(df[,2], 1, 1), lst)
data.frame(index = unlist(lst1), df[rep(seq_len(nrow(df)), lengths(lst1)), -(2:3)])

Related

R - applying calculation pairwise on columns of data frame/data table

Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14

How to use map_if based on the name of each element in the list

suppose I have a list of dataframes as follows:
df1 <- data.frame(a1 = 1:5, a2 = 1:5, a3 = 1:5)
df2 <- data.frame(a1 = 1:3, a2 = 2:4, a3 = 3:5)
df3 <- data.frame(a1 = 10:20, a2 = 5:15)
l <- list(df1 = df1, df2 = df2, df3 = df3)
What should I do to perform operations (like mutate) on each element on the list conditioning on the elements name?
For instance - how would I proceed If I wanted to add some new column only if was dealing with df1 or df3 and wanted to delete some column if I was dealing with df2?
Could map_if deal with that?
PS: Keep in mind that the list would probably have more than 3 datasets so that possibly multiple conditions would be needed.
You can do this sort of operations with imap instead. Since you would like to do a certain operations based on names of the your list or names of the elements of the list you should use imap.
.f argument in imap takes 2 arguments:
.x which is the first argument and represents the value
.y which is the second argument and represents the names of you arguments and in case they don't have names, it represents their positions
So for example in this case .xs are your 3 data sets and .ys are their names df1:df3 or their positions 1:3.
library(purrr)
l %>%
imap(~ if(.y %in% c("df1", "df3")) {
.x %>%
mutate(a3 = a1 + a2)
} else {
.x <- .x[-3]
.x
})
$df1
a1 a2 a3
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
$df2
a1 a2
1 1 2
2 2 3
3 3 4
$df3
a1 a2 a3
1 10 5 15
2 11 6 17
3 12 7 19
4 13 8 21
5 14 9 23
6 15 10 25
7 16 11 27
8 17 12 29
9 18 13 31
10 19 14 33
11 20 15 35
But if you would like to apply a certain function on each of your elements that meets a certain condition then you can use map_if. For example we would like to add a4 column if the number of rows in each are greater than a certain number. Bear in mind that .p argument should return a single TRUE or FALSE:
# This use case works
l %>%
map_if(~ nrow(.x) > 3, ~ .x %>%
mutate(a4 = a1 + a2))
# But this doesn't becase names(.x) are actually column names of each element and the result is not what you are after
l %>%
map_if(~ names(.x) %in% c("df1", "df3"), ~ .x %>%
mutate(a4 = a1 + a2))
An equivalent to imap is map2 where the second argument is the names of each element (and not the column names of each element):
l %>%
map2(names(l), ~ if(.y %in% c("df1", "df3")) {
.x %>%
mutate(a3 = a1 + a2)
} else {
.x <- .x[-3]
.x
})

Conditional count in data frame

I have a dataframe (df) with three columns like so:
Structure:
id id1 age
A1 a1 32
A1 a2 45
A1 a3 45
A1 a4 12
A2 b1 15
A2 b5 34
A2 b64 17
Expected Output:
id count count1
A1 4 1
A2 3 2
Logic:
Column "count" is the number of times "id" is repeated
Column "count1" is the number of rows where age is less than 21
Current Code:
library(dplyr)
df_summarized <- df %>%
group_by(id) >%>
summarise(count = n(),count1 = count(age<21))
Problem:
Error: no applicable method for 'group_by_' applied to an object of class "logical"
We need to do the sum
df %>%
group_by(id) %>%
summarise(count = n(),count1 = sum(age < 21))
# A tibble: 2 × 3
# id count count1
# <chr> <int> <int>
#1 A1 4 1
#2 A2 3 2
as count applies to data.frame or tbl_df and not in a single column inside the summarise
Or using data.table
library(data.table)
setDT(df)[, .(count = .N, count1 = sum(age < 21)), id]
Or with base R
cbind(count = rowSums(table(df[-2])), count1 = as.vector(rowsum(+(df$age < 21), df$id)))
# count count1
#A1 4 1
#A2 3 2
Or using aggregate based on the sum
do.call(data.frame, aggregate(age~id, df, FUN =
function(x) c(count = length(x), count1 = sum(x<21))))
NOTE: All the above methods give the dataset with proper columns. This will be especially noted in aggregate. That is the reason the output column i.e. a matrix is converted to proper columns with do.call(data.frame
With base R, we can use aggregate to find number of rows for each group (id) as well as number of rows with value less than 21
aggregate(age~id, df, function(x) c(count = length(x),
count1 = length(x[x < 21])))
# id age.count age.count1
#1 A1 4 1
#2 A2 3 2

forloop inside dplyr mutate

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.
here is an example
Sample data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))
sample code:
dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))
Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.
Your data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
c_n=rnorm(20))
library(dplyr)
library(tidyr)
This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.
variableTransform <- data_frame(
cnum = paste0("c", 1:12),
cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
# cnum cvar
# <chr> <chr>
# 1 c1 a1
# 2 c2 a1
# 3 c3 a1
# 4 c4 a2
# 5 c5 a2
# 6 c6 a2
Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
tidyr::spread(cvar, value) %>%
mutate(
ans = a1 * (a2/c_n) + a3 * (a4/c_n)
) %>%
head
# treatment c_n xyz a1 a2 a3 a4 ans
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036
First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
# treatment c_n cnum value
# 1 a 0.20745647 c1_x -0.1250222
# 2 b 0.01015871 c1_x -0.4585088
# 3 a 1.65671028 c1_x -0.2455927
# 4 b -0.24037137 c1_x 0.6219516
# 5 a -1.16092349 c1_x -0.3716138
# 6 b 1.61191700 c1_x 1.7605452
It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
# treatment c_n cnum xyz value
# 1 a 0.20745647 c1 x -0.1250222
# 2 b 0.01015871 c1 x -0.4585088
# 3 a 1.65671028 c1 x -0.2455927
# 4 b -0.24037137 c1 x 0.6219516
# 5 a -1.16092349 c1 x -0.3716138
# 6 b 1.61191700 c1 x 1.7605452
From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
# treatment c_n xyz value cvar
# 1 a 0.20745647 x -0.1250222 a1
# 2 b 0.01015871 x -0.4585088 a1
# 3 a 1.65671028 x -0.2455927 a1
# 4 b -0.24037137 x 0.6219516 a1
# 5 a -1.16092349 x -0.3716138 a1
# 6 b 1.61191700 x 1.7605452 a1
Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)
tidyr::spread(cvar, value) %>% head
# treatment c_n xyz a1 a2 a3 a4
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095
From here, we just need to mutate to get the right answer.
Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).
library(tidyr)
library(stringr)
library(dplyr)
# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
# separate off the xyz and the c123
separate(col = cc_xyz, into = c("cc", "xyz")) %>%
# extract the number
mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
# mod it by 4 for groupings and add a letter so its a good col name
num_mod = paste0("v", (num %% 4) + 1)) %>%
# remove unwanted columns
select(-cc, -num) %>%
# go into a reasonable data width for calculation
spread(key = num_mod, value = value) %>%
# calculate
mutate(result = v1 + v2/c_n + v3 + v4 / c_n)
# treatment c_n xyz v1 v2 v3 v4 result
# 1 a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295
# 2 a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188
# 3 a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998
# 4 a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405
# 5 a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289
# 6 a -0.911463819 z 0.957778345 -1.136532104 1.3959561507 -0.50021647 4.14947069
# ...

Merging many CSVs into different data frames

I many CSVs, each corresponding to a day's worth of data, stored like this:
Day1.csv:
ID, height, weight, color
a1, 3, 45, blue
a2, 3, 44, green
a3, 4, 48, blue
Day 2.csv:
ID, height, weight, color
a1, 4, 47, green
a2, 4, 44, green
a3, 5, 49, yellow
I want to make a separate data frame for each feature (i.e. height, weight, etc.) with information from each csv. The output would look like this for each feature:
height.df:
ID, Day1, Day2
a1, 3, 4
a2, 3, 4
a3, 3, 5
I have tried to use merge(), but that requires that I input only two columns at a time. I'm also not sure how to use the filename to label the column.
I would consider just putting all the data into a list and rbinding the data together (if the columns are of the same types).
Example:
## Assume you have read in files and saved them as `data.frame`s named
## "day1", "day2", and so on....
temp <- mget(ls(pattern = "day\\d+"))
long <- do.call(rbind, lapply(names(temp), function(x) cbind(Day = x, temp[[x]])))
From there, you can do transformations quite easily. For instance, make the entire dataset into a "wide" dataset:
reshape(long, direction = "wide", idvar = "ID", timevar = "Day")
# ID height.day1 weight.day1 color.day1 height.day2 weight.day2 color.day2
# 1 a1 3 45 blue 4 47 green
# 2 a2 3 44 green 4 44 green
# 3 a3 4 48 blue 5 49 yellow
Or, just a specific variable:
library(data.table)
dcast.data.table(as.data.table(long), ID ~ Day, value.var = "height")
# ID day1 day2
# 1: a1 3 4
# 2: a2 3 4
# 3: a3 4 5
If you really want to make separate data frames, here's one way you could do it:
Day1.csv <- read.table(header=T, sep=",", text="
ID, height, weight, color
a1, 3, 45, blue
a2, 3, 44, green
a3, 4, 48, blue")
Day2.csv <- read.table(header=T, sep=",", text="
ID, height, weight, color
a1, 4, 47, green
a2, 4, 44, green
a3, 5, 49, yellow")
library(tidyr)
l <- mget(ls(pattern = "Day\\d+\\.csv"))
df <- do.call(rbind, lapply(seq(l), function(x) transform(l[[x]], Day = paste0("Day", gsub("\\D", "", names(l)[x])))))
df <- gather(df, variable, value, -ID, -Day)
list2env(
setNames(lapply(levels(df$variable), function(x) {
spread(df[df$variable == x, -which(names(df) == "variable")], Day, value, fill = 0)
}), paste0(levels(df$variable), ".df")), globalenv())
weight.df
# ID Day1 Day2
# 1 a1 45 47
# 2 a2 44 44
# 3 a3 48 49
height.df
# ID Day1 Day2
# 1 a1 3 4
# 2 a2 3 4
# 3 a3 4 5
color.df
# ID Day1 Day2
# 1 a1 blue green
# 2 a2 green green
# 3 a3 blue yellow
This seems to work with two dataframes.
file_names <- list.files('C:/mydirectory/mycsvs', full.names = T)
file_list <- lapply(file_names, read.csv, stringsAsFactors = F)
So put your csvs in a directory, read them into a list.
library(dplyr)
new_dataframes = list()
for(i in 2:ncol(dataframes[[1]])){
new_list <- list()
for(j in 1:length(file_list)){
new_list[[j]] <- file_list[[j]][, c(1, i)]
}
joined_df <- new_list[[1]]
for(j in 2:length(new_list)){
joined_df <- inner_join(joined_df, new_list[[j]], by = 'ID')
}
for(j in 2:ncol(joined_df)){
colnames(joined_df)[j] <- paste0('day ', j - 1)
}
feature_name <- colnames(new_list[[1]])[2]
new_dataframes[[feature_name]] <- joined_df
}
new_dataframes
$height
ID day 1 day 2
1 a1 3 4
2 a2 3 4
3 a3 4 5
$weight
ID day 1 day 2
1 a1 45 47
2 a2 44 44
3 a3 48 49
$color
ID day 1 day 2
1 a1 blue green
2 a2 green green
3 a3 blue yellow
This assumes that you have sequential days (e.g. you have day 1 through day n with no missing days. If that's not the case, it's not that difficult to extract out the day from the file title, assuming it's in the title. This also assumes each dataframe is more or less identical, when it comes to number/order of columns. If that's not the case, this won't work. It also does an inner_join, so you're only going to get records that have matching IDs.
And if anyone has an idea for getting rid of those loops, I'd love to hear it, especially for that join step. That can probably get relatively slow depending on the size of the data.
Anyway, you end up with a list of dataframes, where the name of the list element is the feature. Depending on how much data you have, this list could get very large.

Resources