How can I transform data X to Y as in
X = data.frame(
ID = c(1,1,1,2,2),
NAME = c("MIKE","MIKE","MIKE","LUCY","LUCY"),
SEX = c("MALE","MALE","MALE","FEMALE","FEMALE"),
TEST = c(1,2,3,1,2),
SCORE = c(70,80,90,65,75)
)
Y = data.frame(
ID = c(1,2),
NAME = c("MIKE","LUCY"),
SEX = c("MALE","FEMALE"),
TEST_1 =c(70,65),
TEST_2 =c(80,75),
TEST_3 =c(90,NA)
)
The dcast function in reshape2 seems to work but it can not include other columns in the data like ID, NAME and SEX in the example above.
Assuming all other columns by a ID column are consistent, like Mike can only be a male with ID 1, how can we do it?
According to the documentation (?reshape2::dcast), dcast() allows for ... in the formula:
"..." represents all other variables not used in the formula ...
This is true for both the reshape2 and the data.table packages which both support dcast().
So, you can write:
reshape2::dcast(X, ... ~ TEST, value.var = "SCORE")
# ID NAME SEX 1 2 3
#1 1 MIKE MALE 70 80 90
#2 2 LUCY FEMALE 65 75 NA
However, if the OP insists that the column names should be TEST_1, TEST_2, etc., the TEST column needs to be modified before reshaping. Here, data.table is used:
library(data.table)
dcast(setDT(X)[, TEST := paste0("TEST_", TEST)], ... ~ TEST, value.var = "SCORE")
# ID NAME SEX TEST_1 TEST_2 TEST_3
#1: 1 MIKE MALE 70 80 90
#2: 2 LUCY FEMALE 65 75 NA
which is in line with the expected answer given as data.frame Y.
Related
I have four data frames:
df01 <- data.frame(ID = c("001","002","003","004"),
Name = c("Ben","Jennifer","Mark","Brad"),
LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
Age = c(37,41))
df03 <- data.frame(ID = c("003"),
Age = c(28))
df04 <- data.frame(ID = c("004"),
Age = c(48))
I am trying to join using dplyr package with the function left_join like this:
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID")
And my current outcome is
> df
ID Name LastName Age.x Age.y Age
1 001 Ben Affleck 37 NA NA
2 002 Jennifer Lopez 41 NA NA
3 003 Mark Anthony NA 28 NA
4 004 Brad Pitt NA NA 48
But my expected outcome would be:
> df
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this
dfx <- bind_rows(df02,df03,df04)
df <- df01 %>%
left_join(dfx, by = "ID")
but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size ..."
Thank you very much for your help.
Here's a use of Reduce (or you can use purrr::reduce, effectively the same thing):
fun <- function(a, b) {
out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
if (all(c("Age", "Age.y") %in% names(out))) {
out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
select(-Age.y)
}
out
}
Reduce(fun, list(df01, df02, df03, df04))
# ID Name LastName Age
# 1 001 Ben Affleck 37
# 2 002 Jennifer Lopez 41
# 3 003 Mark Anthony 28
# 4 004 Brad Pitt 48
Quick walk-through:
Reduce calls the function (fun here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhausted
coalesce is a great function that returns the first non-NA value of the values provided; and it's vectorized, which is great; try coalesce(c(1,NA,3), c(22,33,44)) and get c(1,33,3).
As you have only one value for each new Age column. You could sum all of them after the left joins.
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID") %>%
mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>%
select(-stringr::str_subset(colnames(.), "Age."))
With stringr package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."
Generally, join operations are used to combine tables with different column sets. Here df02, df03 and df04 have all the same columns and seem to require row binding, rather than joining.
I would do like this:
> bind_rows(df02, df03, df04) %>% left_join(df01, ., by = "ID")
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
In case you are not sure that the IDs in those tables are unique, you need to decide what to do with duplicates. %>% group_by(ID) %>% summarize(Age = first(Age)) before the left join would select the first age among duplicate IDs, if any.
I've got a data.frame dt with some duplicate keys and missing data, i.e.
Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA
In this case the key is the name, and I would like to apply to each column a function like
f <- function(x){
x <- x[!is.na(x)]
x <- x[1]
return(x)
}
while aggregating by the key (i.e., the "Name" column), so as to obtain as a result
Name Height Weight Age
Alice 180 70 35
Bob NA 80 27
Charles 170 75 NA
I tried
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f)
and I got some errors, then I tried the following
dt_agg_1 <- aggregate(Height ~ Name,
data = dt,
FUN = f)
dt_agg_2 <- aggregate(Weight ~ Name,
data = dt,
FUN = f)
and this time it worked.
Since I have 50 columns, this second approach is quite cumbersome for me. Is there a way to fix the first approach?
Thanks for help!
You were very close with the aggregate function, you needed to adjust how aggregate handles NA (from na.omit to na.pass). My guess is that aggregate removes all rows with NA first and then does its aggregating, instead of removing NAs as aggregate iterates over the columns to be aggregated. Since your example dataframe you have an NA in each row you end up with a 0-row dataframe (which is the error I was getting when running your code). I tested this by removing all but one NA and your code works as-is. So we set na.action = na.pass to pass the NA's through.
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f, na.action = "na.pass")
original answer
dt_agg <- aggregate(dt[, -1],
by = list(dt$Name),
FUN = f)
dt_agg
# Group.1 Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
You can do this with dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
summarize_all(funs(sort(.)[1]))
Result:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <int> <int> <int>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
Data:
df = read.table(text = "Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA", header = TRUE)
Here is an option with data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) head(sort(x), 1)), Name]
# Name Height Weight Age
#1: Alice 180 70 35
#2: Bob NA 80 27
#3: Charles 170 75 NA
Simply, add na.action=na.pass in aggregate() call:
aggdf <- aggregate(.~Name, data=df, FUN=f, na.action=na.pass)
# Name Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
If you add an ifelse() to your function to make sure the function returns a value if all values are NA:
f <- function(x) {
x <- x[!is.na(x)]
ifelse(length(x) == 0, NA, x)
}
You can use dplyr to aggregate:
library(dplyr)
dt %>% group_by(Name) %>% summarise_all(funs(f))
This returns:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <dbl> <dbl> <dbl>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
Suppose that I gave a treatment to some column values of a data frame like this:
id animal weight height ...
1 dog 23.0
2 cat NA
3 duck 1.2
4 fairy 0.2
5 snake BAD
df <- data.frame(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"))
Suppose that the treatment require to work in a separately table, and gave as the result, the following data frame that is a subset of the original:
id animal weight
2 cat 2.2
5 snake 1.3
sub_df <- data.frame(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
Now I want to put all together again, so I use an operation like this:
> df %>%
anti_join(sub_df, by = c("id", "animal")) %>%
bind_rows(sub_df)
id animal weight
4 fairy 0.2
1 dog 23.0
3 duck 1.2
2 cat 2.2
5 snake 1.3
Exist some way to do this directly with join operations?
In the case that the subset is just the key column and the variable subject to give a treatment (id, animal weigth) and not the total variables of the original data frame (id, animal, weight, height), how could assemble the subset with the original set?
What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).
Here's an example for your toy data:
library(data.table)
setDT(df) # convert to data.table without copy
setDT(sub_df) # convert to data.table without copy
# join and update "df" by reference, i.e. without copy
df[sub_df, on = c("id", "animal"), weight := i.weight]
The data is now updated:
# id animal weight
#1: 1 dog 23.0
#2: 2 cat 2.2
#3: 3 duck 1.2
#4: 4 fairy 0.2
#5: 5 snake 1.3
You can use setDF to switch back to ordinary data.frame.
Remove the na's first, then simply stack the tibbles:
bind_rows(filter(df,!is.na(weight)),sub_df)
Isn't dplyr::rows_update exactly what we need here? The following code should work:
df %>% dplyr::rows_update(sub_df, by = "id")
This should work as long as there is a unique identifier (one or multiple variables) for your datasets.
For anyone looking for a solution to use in a tidyverse pipeline:
I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.
For example, if the OP's df had an additional 'height' column:
library(dplyr)
df <- tibble(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"),
height = c("54", "45", "21", "50", "42"))
And the subset of data we wanted to join in was the same:
sub_df <- tibble(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.
In this case we could use the following function:
replace_subset <- function(df, df_subset, id_col_names = c()) {
# work out which of the columns contain "new" data
new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]
# complete the df_subset with the extra columns from df
df_sub_to_join <- df_subset %>%
left_join(select(df, -new_data_col_names), by = c(id_col_names))
# join and bind rows
df_out <- df %>%
anti_join(df_sub_to_join, by = c(id_col_names)) %>%
bind_rows(df_sub_to_join)
return(df_out)
}
Now for the results:
replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))
## A tibble: 5 x 4
# id animal weight height
# <dbl> <chr> <chr> <chr>
#1 1 dog 23 54
#2 3 duck 1.2 21
#3 4 fairy 0.2 50
#4 2 cat 2.2 45
#5 5 snake 1.3 42
And here's an example using the function in a pipeline:
df %>%
replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
mutate(bmi = weight / (height^2))
## A tibble: 5 x 5
# id animal weight height bmi
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 1 dog 23 54 0.00789
#2 3 duck 1.2 21 0.00272
#3 4 fairy 0.2 50 0.00008
#4 2 cat 2.2 45 0.00109
#5 5 snake 1.3 42 0.000737
hope this is helpful :)
I have a data frame describing a large number of people. I want to assign each person to a group, based on several variables. For example, let's say I have the variable "state" with 5 states, the variable "age group" with 4 groups and the variable "income" with 5 groups. I will have 5x4x5 = 100 groups, that I want to name with numbers going from 1 to 100. I have always done this in the past using a combination of ifelse statements, but now as I have 100 possible outcomes I am wondering if there is a faster way than specifying each combination by hand.
Here's a MWE with the expected outcome:
mydata <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),c(1,4,2,3,1,5,5,3,4,2)))
colnames(mydata) <- c("Country","Age","Income")
group_grid <- transform(expand.grid(state = c("IT","FR","UK","ES","DE"),
age = c("20","40","60","80"), income = 1:5), val = 1:100)
desired_result <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),
c(1,4,2,3,1,5,5,3,4,2),
c(2,78,23,46,15,84,92,60,66,33)))
colnames(desired_result) <- c("Country","Age","Income","Group_code")
mydata$Group_code <- with(mydata, as.integer(interaction(Country, Age, Income))) should do it.
Here is left_join option using dplyr
library(dplyr)
grpD <- group_grid %>%
mutate_if(is.factor, as.character) %>% #change to character class as joining
mutate(income = as.character(income))#with same class columns are reqd.
mydata %>%
mutate_if(is.factor, as.character) %>% #change class here too
left_join(., grpD, by= c("Country" = "state", "Age" = "age", "Income" = "income"))
# Country Age Income val
#1 FR 20 1 2
#2 UK 80 4 78
#3 UK 20 2 23
#4 IT 40 3 46
#5 DE 60 1 15
#6 ES 20 5 84
#7 FR 60 5 92
#8 DE 80 3 60
#9 IT 40 4 66
#10 UK 60 2 33
Here is the data similar to that I am using :-
df <- data.frame(Name=c("Joy","Jane","Jane","Joy"),Grade=c(40,20,63,110))
Name Grade
1 Joy 40
2 Jane 20
3 Jane 63
4 Joy 110
Agg <- ddply(df, .(Name), summarize,Grade= max(Grade))
Name Grade
1 Jane 63
2 Joy 110
As the grade cannot be greater than 100, I need 40 as the value of for Joy and not 110. Basically I want to exclude all the values greater than 100 while summarizing. I can create a new data frame by excluding the values and then applying the ddply function, but would like to know if I can do it on my original data frame. Thanks in advance.
Using ddply, we can use the logical condition to subset the values of 'Grade'
library(plyr)
ddply(df, .(Name), summarise, Grade = max(Grade[Grade <=100]))
# Name Grade
#1 Jane 63
#2 Joy 40
Or with dplyr, we filter the "Grade" that are less than or equal to 100, then grouped by "Name", get the max of "Grade"
library(dplyr)
df %>%
filter(Grade <= 100) %>%
group_by(Name) %>%
summarise(Grade = max(Grade))
# Name Grade
# <fctr> <dbl>
#1 Jane 63
#2 Joy 40
Or instead of the filter, we can create the logical condition in summarise
df %>%
group_by(Name) %>%
summarise(Grade = max(Grade[Grade <=100]))
Or with data.table, convert the 'data.frame' to 'data.table' (setDT(df)), create the logical condition (Grade <= 100) in 'i', grouped by "Name", get the max of "Grade".
library(data.table)
setDT(df)[Grade <= 100, .(Grade = max(Grade)), by = Name]
# Name Grade
#1: Joy 40
#2: Jane 63
Or using sqldf
library(sqldf)
sqldf("select Name,
max(Grade) as Grade
from df
where Grade <= 100
group by Name")
# Name Grade
#1 Jane 63
#2 Joy 40
In base R, another variant of aggregate would be
aggregate(Grade ~ Name, df, subset = Grade <= 100, max)
# Name Grade
#1 Jane 63
#2 Joy 40
You can also use base R aggregate for the same
aggregate(Grade ~ Name, df[df$Grade <= 100, ], max)
# Name Grade
#1 Jane 63
#2 Joy 40