reshape and aggregate datatable - r

I asked a very similar question and because I haven't quite gotten a handle on tidyr or reshape I have to ask another question. I have a datatable containing repeat id values (see below):
id Product NI
1 Direct Auto 15
2 Direct Auto 15
3 Direct Auto 15
4 Direct Auto 15
5 Direct Auto 15
6 Direct Auto 15
6 Mortgage 50
9 Direct Auto 15
10 Direct Auto 15
11 Direct Auto 15
12 Direct Auto 15
13 Direct Auto 15
14 Direct Auto 15
15 Direct Auto 15
16 Direct Auto 15
1 Mortgage 50
5 Personal 110
19 Direct Auto 15
20 Direct Auto 15
1 Direct Auto 15
I would like the id aggregated to one row, the Product column to be 'spread' so that the values become variables, another variable containing the aggregated count of each Product by id, and the NI to be summed for each of the product groups by ID. So see example below:
id DirectAuto DA_NI Mortgage Mortgage_NI Personal P_NI
1 2 30 1 50 NA NA
2 1 15 NA NA NA NA
3 1 15 NA NA NA NA
4 1 15 NA NA NA NA
5 1 15 NA NA 1 110
6 1 15 1 50 NA NA
9 1 15 NA NA NA NA
11 1 15 NA NA NA NA
12 1 15 NA NA NA NA
13 1 15 NA NA NA NA
14 1 15 NA NA NA NA
15 1 15 NA NA NA NA
16 1 15 NA NA NA NA
19 1 15 NA NA NA NA
20 1 15 NA NA NA NA
For example, id 1 has 2 Direct Auto, so his DA_NI is 30 and he has 1 Mortgage so his NI is Mortgage_NI = 50.
So, basically make a 'wider' datatable. I'm still reading and practicing tidyr and reshape, but in the mean-time maybe someone can help.
Here is some of my starting code:
df[, .(tot = .N, NI = sum(NI)), by = c("id","Product")]
Afterwards, using some tidyr & reshape commands I can't seem to get the final output I want.

data.table v1.9.5 has more nicer features for melting and casting. Using dcast from the devel version:
require(data.table) # v1.9.5
dcast(dt, id ~ Product, fun.agg = list(sum, length), value.var="NI", fill=NA)
I think this is what you're looking for. You can checkout the new HTML vignettes here.
Rename the columns to your liking.

It's a little tricky to do this. It can be done using tidyr and dplyr though it goes against Hadley Wickgam's tidy data principles.
dat %>%
group_by(id, Product) %>%
summarise(NI = sum(NI), n = n()) %>%
gather(variable, value, n, NI) %>%
mutate(
col_name = ifelse(variable == "n",
as.character(Product),
paste(Product, variable, sep = "_"))
) %>%
select(-c(Product, variable)) %>%
spread(col_name, value)

Related

How to get the NA while we exclude them for analysis

I have a large column with NAs, I want to rank the time as shown below. I want to keep NAs while I exclude them from the analysis,
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=True)
I want to get the following table:
time Rank
40 3
30 4
50 2
NA NA
60 1
NA NA
20 5
I have used the following code:
df$Rank<--df$time,ties.method="mim")
#fixed data
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=TRUE)
You can do something like
nonNaIndices <- !is.na(df$time)
df$Rank <- NA
df$Rank[nonNaIndices] <- rank(df$time[nonNaIndices],ties.method="min")
resulting in
> df
time Rank
1 40 3
2 30 2
3 50 4
4 NA NA
5 60 5
6 NA NA
7 20 1
Note: Please make sure to check your question for missing function calls before submitting it. In your case it could be guessed from the context.
You can use dense_rank from dplyr -
library(dplyr)
df %>% mutate(Rank = dense_rank(-time))
# time Rank
#1 40 3
#2 30 4
#3 50 2
#4 NA NA
#5 60 1
#6 NA NA
#7 20 5

Wide format: a function to calculate row means for specific batches of columns, then scale up for multiple batches

This is a followup question to a previous post of mine about building a function for calculating row means.
I want to use any function of the apply family to iterate over my dataset and each time compute the row mean (which is what the function does) for a group of columns I specify. Unfortunately, I miss something critical in the way I should tweak apply(), because I get an error that I can't troubleshoot.
Example Data
capital_cities_df <-
data.frame("europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(capital_cities_df,
function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
> capital_cities_df
europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 NA NA NA 41 NA
2 NA 12 22 NA 42 52
3 3 NA 23 33 43 NA
4 NA 14 NA NA NA NA
5 NA 15 25 35 45 NA
6 6 NA NA 36 NA 56
7 NA 17 NA NA NA 57
8 NA 18 NA 38 48 NA
9 NA 19 NA 39 49 NA
10 10 NA 30 40 NA 60
Custom Function
library(dplyr)
library(rlang)
continent_mean <- function(df, continent) {
df %>%
select(starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
## works for a single case:
continent_mean(capital_cities_df, "europe")
europe_paris europe_london europe_rome europe
1 1 NA 21 11
2 2 12 22 12
3 3 NA 23 13
4 4 14 NA 9
5 NA 15 25 20
6 6 16 26 16
7 NA 17 NA 17
8 NA 18 NA 18
9 NA 19 NA 19
10 10 20 30 20
Trying to apply the function over the data, unsuccessfully
apply(
capital_cities_df,
MARGIN = 2,
FUN = continent_mean(capital_cities_df, continent = "europe")
)
Error in match.fun(FUN) :
'continent_mean(capital_cities_df, continent = "europe")' is not a function, character or symbol
Any other combination of the arguments in apply() didn't work either, nor did sapply. This unsuccessful attempt of using apply is only for one type of columns I wish to get the mean for ("europe"). However, my ultimate goal is to be able to pass c("europe", "asia", etc.) with apply, so I could get the custom function to create row means columns for all groups of columns I specify, in one hit.
What is wrong with my code?
Thanks!
EDIT 19-AUG-2019
I was trying the solution suggested by A. Suliman (see below). It did work for the example data I posted here, but not when trying to scale it up to my real dataset, where I need to subset additional columns (rather than the "continent" batch only). More specifically, in my real data I have an ID column which I want to get outputted along the other data, when I apply my custom-made function.
Example data including "ID" column
capital_cities_df <- data.frame(
"europe_paris" = 1:10,
"europe_london" = 11:20,
"europe_rome" = 21:30,
"asia_bangkok" = 31:40,
"asia_tokyo" = 41:50,
"asia_kathmandu" = 51:60)
set.seed(123)
capital_cities_df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA),
prob = c(0.70, 0.30),
size = length(cc),
replace = TRUE) ]))
id <- 1:10
capital_cities_df <- cbind(id, capital_cities_df)
> capital_cities_df
id europe_paris europe_london europe_rome asia_bangkok asia_tokyo asia_kathmandu
1 1 1 NA NA NA 41 NA
2 2 NA 12 22 NA 42 52
3 3 3 NA 23 33 43 NA
4 4 NA 14 NA NA NA NA
5 5 NA 15 25 35 45 NA
6 6 6 NA NA 36 NA 56
7 7 NA 17 NA NA NA 57
8 8 NA 18 NA 38 48 NA
9 9 NA 19 NA 39 49 NA
10 10 10 NA 30 40 NA 60
My function (edited to select id as well)
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
dplyr::mutate(!!quo_name(continent) := rowMeans(., na.rm = TRUE))
}
> continent_mean(capital_cities_df, "europe") ## works in a single run
id europe_paris europe_london europe_rome europe
1 1 1 NA NA 1.000000
2 2 NA 12 22 12.000000
3 3 3 NA 23 9.666667
4 4 NA 14 NA 9.000000
5 5 NA 15 25 15.000000
6 6 6 NA NA 6.000000
7 7 NA 17 NA 12.000000
8 8 NA 18 NA 13.000000
9 9 NA 19 NA 14.000000
10 10 10 NA 30 16.666667
Trying to apply the function beyond the single use (based on A. Suliman's method) -- unsuccessfully
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
## or:
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
In either case I get a variety of error messages:
Error in inds_combine(.vars, ind_list) : Position must be between 0
and n
At other times:
Error: invalid column index : NA for variable: 'NA' = 'NA'
All I wanted was a simple function to let me calculate row means per specification of which columns to run over, but this gets nasty for some reason. Even though I'm eager to figure out what's wrong with my code, if anybody has a better overarching solution for the entire process I'd be thankful too.
Thanks!
Use lapply to loop through continents then use grep to select columns with the current continent
continents <- c("europe", "asia")
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, grep(x, names(capital_cities_df))], continent=x))
#To a dataframe not a list
do.call(cbind, lst)
Using map_dfc from purrr we can get the result in one step
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, grep(.x, names(capital_cities_df))], continent=.x))
Update:
#grep will return column positions when they match with "europe" or "asia", e.g
> grep("europe", names(capital_cities_df))
[1] 2 3 4
#If we need the column names then we add value=TRUE to grep
> grep("europe", names(capital_cities_df), value = TRUE)
[1] "europe_paris" "europe_london" "europe_rome"
So to add a new column we can just use the c() function and call the function as usual
#NOTE: Here I'm using the old function without select
lst <- lapply(continents, function(x) continent_mean(df=capital_cities_df[, c('id',grep(x, names(capital_cities_df), value = TRUE))], continent=x))
do.call(cbind, lst)
id europe_paris europe_london europe_rome europe id asia_bangkok asia_tokyo asia_kathmandu asia
1 1 1 NA NA 1.00000 1 NA 41 51 31.00000
2 2 NA 12 22 12.00000 2 NA 42 52 32.00000
3 3 3 13 23 10.50000 3 33 43 NA 26.33333
4 4 NA 14 NA 9.00000 4 NA 44 54 34.00000
5 5 NA 15 25 15.00000 5 35 45 55 35.00000
6 6 6 NA NA 6.00000 6 36 46 56 36.00000
7 7 7 17 27 14.50000 7 NA 47 57 37.00000
8 8 NA 18 28 18.00000 8 38 48 NA 31.33333
9 9 9 19 29 16.50000 9 39 49 NA 32.33333
10 10 10 NA 30 16.66667 10 40 NA 60 36.66667
#We have one problem, id column gets duplicated, map_dfc with select will solve this issue
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df[, c('id',grep(.x, names(capital_cities_df), value = TRUE))], continent=.x)) %>%
#Don't select any column name ends with id followed by one digit
select(-matches('id\\d'))
If you'd like to use the new function with select then just pass capital_cities_df without grep, e.g using map_dfc
purrr::map_dfc(continents, ~continent_mean(df=capital_cities_df, continent=.x)) %>%
select(-matches('id\\d'))
Correction: in continent_mean
continent_mean <- function(df, continent) {
df %>%
select(., id, starts_with(continent)) %>%
#Exclude id from the rowMeans calculation
dplyr::mutate(!!quo_name(continent) := rowMeans(.[grep(continent, names(.))], na.rm = TRUE))
}

copy data frame to the other matching row number

I have a dataframe A like below.
Notice that the first column is the row name with random order.
ID
5 10
3 10
1 10
Them. I have another 5 * 1 data frame B with NAs. I am trying to copy A to B matching the column names in A. I want to get a data frame like below.
ID
1 10
2 NA
3 10
4 NA
5 10
What you are trying to do is potentially dangerous. If you are 100% sure that the rows contain identifiers that would match between the 2 data frames, here's the code.
library(tidyverse)
# Generate a data frame that looks like yours (you don't need this)
df <- data.frame(ID=c(10, NA, 10, NA, 10))
# Assign row names to a new column on the df
df$names <- row.names(df)
# Here's how your data will look like
df<-df[complete.cases(df),]
# Make a second df
df2 <- data.frame(names=as.character(1:20))
# Join by names (what are other possible columns to join by ?)
left_join(df2, df, by="names")
This will produce
names ID
1 1 10
2 2 NA
3 3 10
4 4 NA
5 5 10
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
12 12 NA
13 13 NA
14 14 NA
15 15 NA
16 16 NA
17 17 NA
18 18 NA
19 19 NA
20 20 NA

Overwriting a row with a matched ID value in the same dataframe

I have an odd issue I just haven't been able to find a good solution to. Basically, our system outputs scores for people, usually only the first time they are in the system. However, sometimes it enters the scores the second time they are in the system, and sometimes it fills all rows like it should. Correcting the database would be ideal but that isn't going to happen (thanks management). We also can't just get rid of duplicate ID values as they are duplicated for a reason. What I need to do is copy the scores into the fields that have NA for all matching ID values. So, here is a data example:
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 16 15 14 15 46
1 NA NA NA NA NA
2 15 12 11 14 12
3 14 12 12 15 22
3 14 12 12 15 22
4 NA NA NA NA NA
4 11 04 12 33 12
6 NA NA NA NA NA
The output would look like
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 16 15 14 15 46
1 16 15 14 15 46
2 15 12 11 14 12
3 14 12 12 15 22
3 14 12 12 15 22
4 11 04 12 33 12
4 11 04 12 33 12
6 NA NA NA NA NA
I managed to get something working for this problem in order to move it off my desk, but this problem is going to be reoccuring and I want a better solution. My solution is:
df_2 <- list()
for(i in df$ID){
filter(df, ID == i) %>%
mutate(VAR1 = mean(VAR1, na.rm = TRUE),
VAR2 = mean(VAR2, na.rm = TRUE),
VAR3 = mean(VAR3, na.rm = TRUE),
VAR4 = mean(VAR4, na.rm = TRUE),
VAR5 = mean(VAR5, na.rm = TRUE))
} -> df_2[[i]]
# Then we bind this together as a dataframe
bind_rows(df_2) -> df_replaced
# Remove the list object as it's huge
rm(df_2)
This works but it takes about a thousand years and creates a temporary list around 4 gigs (df_2). Which is why I need to remove it as soon as possible because it pretty much brings my system to a complete halt. I feel like there is something with match but I'm not really sure how to intelligently select data row to copy over NA row.
EDIT: Fixed table formatting.
Here is a base R method using is.na and match to select the indices to use as fillers and to be filled.
df[is.na(df$VAR1), -1] <- df[match(df$ID[is.na(df$VAR1)],
df$ID[ifelse(!is.na(df$VAR1), TRUE, NA)]), -1]
which returns
df
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 1 16 15 14 15 46
2 1 16 15 14 15 46
3 2 15 12 11 14 12
4 3 14 12 12 15 22
5 3 14 12 12 15 22
6 4 11 4 12 33 12
7 4 11 4 12 33 12
8 6 NA NA NA NA NA
The trick here is to use ifelse to return a table (the second argument to match) that is the same length as the number of rows in the data.frame.

Add an integer to every element of data frame

Say I have a data frame as follows
rsi5 rsi10
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 44.96650 NA
7 39.68831 NA
8 28.35625 NA
9 37.77910 NA
10 53.54822 NA
11 52.05308 46.01867
12 80.44368 66.09973
13 60.88418 56.04507
14 53.59851 52.10633
15 46.45874 48.23648
I wish to simply add 1 (i.e. 9 becomes 10) to each non-NA element of this data frame. There is probably a very simple solution to this but simple arithmetics on dataframes do not seem to work in R giving very strange results.
Just use + 1 as you would expect. Below is a mock example as it wasn't worth copying your data for for this.
Step One: Create a data.frame
R> df <- data.frame(A=c(NA, 1, 2, 3), B=c(NA, NA, 12, 13))
R> df
A B
1 NA NA
2 1 NA
3 2 12
4 3 13
R>
Step Two: Add one
R> df + 1
A B
1 NA NA
2 2 NA
3 3 13
4 4 14
R>

Resources