I have 3 data set. All of them has 1 column called ID. I would like to list out each ID for whole 3 tables (I'm not sure I'm explaining right). For example
df1
ID age
1 34
2 33
5 34
7 35
43 32
76 33
df2
ID height
1 178
2 176
5 166
7 159
43 180
76 178
df3
ID class type
1 a 1
2 b 1
5 a 2
7 b 3
43 b 2
76 a 3
I would like to have an output which looks like this
ID = 1
df1 age
34
df2 height
178
df3 class type
a 1
ID = 2
df1 age
33
df2 height
176
df3 class type
b 1
I wrote a script
listing <- function(x) {
for(i in 1:n) {
data <- print(x[x$ID == 'i', ])
print(data)
}
return(data)
}
why am I not getting the output I wanted?
This is a hack. If you want/need to export to a word document, I strongly urge you to use something like R-Markdown (such as RStudio) using knitr (and, behind the scenes, pandoc). I'd encourage you to look at knitr::kable, for instance, as well as better looping structures for dealing with large numbers of datasets.
This hack can be improved considerably. But it gets you the output you want.
func <- function(...) {
dfnames <- as.character(match.call()[-1])
dfs <- setNames(list(...), dfnames)
IDs <- unique(unlist(lapply(dfs, `[[`, "ID")))
fmt <- paste("%", max(nchar(dfnames)), "s %s", sep = "")
for (id in IDs) {
cat(sprintf("ID = %d\n", id))
for (nm in dfnames) {
df <- dfs[[nm]][ dfs[[nm]]$ID == id, names(dfs[[nm]]) != "ID", drop =FALSE]
cat(paste(sprintf(fmt, c(nm, ""),
capture.output(print(df, row.names = FALSE))),
collapse = "\n"), "\n")
}
}
}
Execution. Though this is showing just two data.frames, you can provide an arbitrary number of data.frames (and in your preferred order) in the function arguments. It assumes you are providing them as direct variables and not subsetting within the function call ... you'll understand if you try it.
func(df1, df3)
# ID = 1
# df1 age
# 34
# df3 class type
# a 1
# ID = 2
# df1 age
# 33
# df3 class type
# b 1
# ID = 5
# df1 age
# 34
# df3 class type
# a 2
# ID = 7
# df1 age
# 35
# df3 class type
# b 3
# ID = 43
# df1 age
# 32
# df3 class type
# b 2
# ID = 76
# df1 age
# 33
# df3 class type
# a 3
(Personally, I can't imagine providing output in this format, but I don't know your tastes or use-case. There are many many other ways to show data like this. Like:
Reduce(function(x,y) merge(x, y, by = "ID"), list(df1, df2, df3))
# ID age height class type
# 1 1 34 178 a 1
# 2 2 33 176 b 1
# 3 5 34 166 a 2
# 4 7 35 159 b 3
# 5 43 32 180 b 2
# 6 76 33 178 a 3
It's much more concise. But, then again, I'm also assuming that you want to show them all at once instead of "show one, talk about it, then show another one, talk about it ...".)
Why not do a merge by id ?
df_1 <- merge( df1, df2, by='ID')
df_fianl <- merge( df_1, df3, by='ID')
or by using
library(dplyr)
full_join(df1, df2)
Related
I have some DF’s with different variable names, but they have the same content. Unfortunately, my files have no pattern, but I am now trying to standardize them. For example, I have these 4 DF’s and I would like to select only one variable:
KEY_WIN <- c(123,456,789)
COUNTRY <- c("USA","FRANCE","MEXICO")
DF1 <- data.frame(KEY_WIN,COUNTRY)
KEY_WINN <- c(12,55,889)
FOOD <- c("RICE","TOMATO","MANGO")
CAR <- c("BMW","FERRARI","TOYOTA")
DF2 <- data.frame(KEY_WINN,FOOD,CAR)
ID <- c(555,698,33)
CITY <- c("NYC","LONDON","PARIS")
DF3 <- data.frame(ID,CITY)
NUMBER <- c(3,436,1000)
OCEAN <- c("PACIFIC","ATLANTIC","INDIAN")
DF4 <- data.frame(NUMBER,OCEAN)
I would like to create a routine to select only the variables KEY_WIN, KEY_WINN, ID, NUMBER. My expected result would be:
DF_FINAL<- data.frame(KEY=c(123,456,789, 12,55,889, 555,698,33, 3,436,1000))
How would I select only those variables?
There are multiple ways I would imagine you could approach this.
First, you could put your data frames in a list:
listofDF <- list(DF1, DF2, DF3, DF4)
Then, you could bind_rows to add the data frames together, and then coalesce to merge into one column.
library(tidyverse)
bind_rows(listofDF) %>%
mutate(KEY = coalesce(KEY_WIN, KEY_WINN, ID, NUMBER)) %>%
select(KEY)
KEY
1 123
2 456
3 789
4 12
5 55
6 889
7 555
8 698
9 33
10 3
11 436
12 1000
If you knew that the first column was always your KEY column, you could simply do:
KEY = unlist(lapply(listofDF, "[[", 1))
This would extract the first column from all of your data frames:
[1] 123 456 789 12 55 889 555 698 33 3 436 1000
I have two data frames
df1 <- data.frame(Region = c(1:5), Code = c(10,11,12,15,15), date = c("2018-12","2018-11","2019-01","2019-01","2019-02"))
df2 <- data.frame(Code = c(10,11,12,13,14,15,16,17,18,19),"2018-10" = c(50:59),"2018-11" = c(20:29),"2018-12" = c(25:34),"2019-01" = c(32:41),"2019-01" = c(40:49),"2019-02" = c(40:49))
I would like to match and store the corresponding values of df1$Region in df3.
The result should look as follows
df3 <- data.frame(Region = c(1:5),Results=c(25,21,34,45,45))
We can use row/column indexing to extract the values by matching the the 'Code' columns as row index, and the 'date/column names` of the the two dataset to get the column index (without using any external packages)
cbind(df1['Region'], Results = df2[-1][cbind(match(df1$Code, df2$Code),
match(df1$date,
sub('^X(\\d{4})\\.', "\\1-", names(df2)[-1])))])
# Region Results
#1 1 25
#2 2 21
#3 3 34
#4 4 37
#5 5 45
NOTE: The column names in the OP's post had ^X at the beginning and its format was . instead of - as it is created with check.names = TRUE (default)
if the datasets were created with check.names = FALSE, the above solution can be further simplified
cbind(df1['Region'], Results = df2[-1][cbind(match(df1$Code, df2$Code),
match(df1$date, names(df2)[-1]))])
# Region Results
#1 1 25
#2 2 21
#3 3 34
#4 4 37
#5 5 45
Update
If the column names are duplicated and wants to match based on that info, then
i1 <- duplicated(df1$date)
v1 <- numeric(nrow(df1))
v1[!i1] <- df2[-1][cbind(match(df1$Code[!i1],
df2$Code),match(df1$date[!i1], names(df2)[-1]))]
v1[i1] <- rev(df2[-1])[cbind(match(df1$Code[i1],
df2$Code),match(df1$date[i1], rev(names(df2)[-1])))]
cbind(df1['Region'], Results = v1)
# Region Results
#1 1 25
#2 2 21
#3 3 34
#4 4 45
#5 5 45
NOTE: No external packages used
One option involving dplyr and tidyr could be:
df1 %>%
inner_join(df2 %>%
pivot_longer(-Code), by = c("Code" = "Code",
"date" = "name"))
Region Code date value
1 1 10 2018-12 25
2 2 11 2018-11 21
3 3 12 2019-01 34
4 4 15 2019-01 37
5 5 15 2019-02 45
I considered two columns in df2 with the same name as a typo.
I search for a generic data frame update function like the sql-update that updates values in the first data frame in case the keys match with the keys in the second data frame. Is there a more generic way as in my example, maybe also by considering the value names? Something like a generic dplyr::update(df1, df2, by = "key") function?
library(tidyverse)
# example data frame
df1 <- as_data_frame(list(key = c(1,2,3,4,5,6,7,8,9),
v1 = c(11,12,13,14,15,16,17,18,19),
v2 = c(21,22,23,24,25,26,27,28,29),
v3 = c(31,32,33,34,35,36,37,38,39),
v4 = c(41,42,43,44,45,46,47,48,49)))
df2 <- as_data_frame(list(key = c(3,5,9),
v2 = c(231,252,293),
v4 = c(424,455,496)))
# update df1 with values from df2 where key match
org_names <- df1 %>% names()
df1 <- df1 %>%
left_join(df2, by = "key") %>%
mutate(v2 = ifelse(is.na(v2.y), v2.x, v2.y),
v4 = ifelse(is.na(v4.y), v4.x, v4.y)) %>%
select(org_names)
> df1
# A tibble: 9 x 5
key v1 v2 v3 v4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41
2 2 12 22 32 42
3 3 13 231 33 424
4 4 14 24 34 44
5 5 15 252 35 455
6 6 16 26 36 46
7 7 17 27 37 47
8 8 18 28 38 48
9 9 19 293 39 496
>
1) %<>% Magrittr has the compound assignment pipe:
library(magrittr)
df1 %>%
{ keys <- intersect(.$key, df2$key)
.[match(keys, .$key), names(df2)] %<>% { df2[match(keys, df2$key), ] }
.
}
which, for the problem under consideration, simplifies to this because all keys in df2 are in df1:
df1 %>% { .[match(df2$key, .$key), names(df2)] %<>% { df2 }; . }
2) <- The basic R assignment operator could also be used in much the same way and, in fact, the code is shorter than (1):
df1 %>%
{ keys <- intersect(.$key, df2$key)
.[match(keys, .$key), names(df2)] <- df2[match(keys, df2$key), ]
.
}
however, for the problem under consideration all keys in df2 are in df1 so it simplifies to:
df1 %>% { .[match(df2$key, .$key), names(df2)] <- df2; . }
3) mutate_cond Using mutate_cond defined in this SO post we can write the following.
df1 %>% mutate_cond(.$key %in% df2$key, v2 = df2$v2, v4 = df2$v4)
Note: The first two approaches work if the keys in df1 and df2 are each unique. The third additionally requires the keys be in the same order and every key in df2 be in df1. The problem in the question satisfies these.
Update: Have somewhat generalized the code in (1) and (2).
I've got a data.frame dt with some duplicate keys and missing data, i.e.
Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA
In this case the key is the name, and I would like to apply to each column a function like
f <- function(x){
x <- x[!is.na(x)]
x <- x[1]
return(x)
}
while aggregating by the key (i.e., the "Name" column), so as to obtain as a result
Name Height Weight Age
Alice 180 70 35
Bob NA 80 27
Charles 170 75 NA
I tried
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f)
and I got some errors, then I tried the following
dt_agg_1 <- aggregate(Height ~ Name,
data = dt,
FUN = f)
dt_agg_2 <- aggregate(Weight ~ Name,
data = dt,
FUN = f)
and this time it worked.
Since I have 50 columns, this second approach is quite cumbersome for me. Is there a way to fix the first approach?
Thanks for help!
You were very close with the aggregate function, you needed to adjust how aggregate handles NA (from na.omit to na.pass). My guess is that aggregate removes all rows with NA first and then does its aggregating, instead of removing NAs as aggregate iterates over the columns to be aggregated. Since your example dataframe you have an NA in each row you end up with a 0-row dataframe (which is the error I was getting when running your code). I tested this by removing all but one NA and your code works as-is. So we set na.action = na.pass to pass the NA's through.
dt_agg <- aggregate(. ~ Name,
data = dt,
FUN = f, na.action = "na.pass")
original answer
dt_agg <- aggregate(dt[, -1],
by = list(dt$Name),
FUN = f)
dt_agg
# Group.1 Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
You can do this with dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
summarize_all(funs(sort(.)[1]))
Result:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <int> <int> <int>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
Data:
df = read.table(text = "Name Height Weight Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA", header = TRUE)
Here is an option with data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) head(sort(x), 1)), Name]
# Name Height Weight Age
#1: Alice 180 70 35
#2: Bob NA 80 27
#3: Charles 170 75 NA
Simply, add na.action=na.pass in aggregate() call:
aggdf <- aggregate(.~Name, data=df, FUN=f, na.action=na.pass)
# Name Height Weight Age
# 1 Alice 180 70 35
# 2 Bob NA 80 27
# 3 Charles 170 75 NA
If you add an ifelse() to your function to make sure the function returns a value if all values are NA:
f <- function(x) {
x <- x[!is.na(x)]
ifelse(length(x) == 0, NA, x)
}
You can use dplyr to aggregate:
library(dplyr)
dt %>% group_by(Name) %>% summarise_all(funs(f))
This returns:
# A tibble: 3 x 4
Name Height Weight Age
<fctr> <dbl> <dbl> <dbl>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA
I have a data frame describing a large number of people. I want to assign each person to a group, based on several variables. For example, let's say I have the variable "state" with 5 states, the variable "age group" with 4 groups and the variable "income" with 5 groups. I will have 5x4x5 = 100 groups, that I want to name with numbers going from 1 to 100. I have always done this in the past using a combination of ifelse statements, but now as I have 100 possible outcomes I am wondering if there is a faster way than specifying each combination by hand.
Here's a MWE with the expected outcome:
mydata <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),c(1,4,2,3,1,5,5,3,4,2)))
colnames(mydata) <- c("Country","Age","Income")
group_grid <- transform(expand.grid(state = c("IT","FR","UK","ES","DE"),
age = c("20","40","60","80"), income = 1:5), val = 1:100)
desired_result <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),
c(1,4,2,3,1,5,5,3,4,2),
c(2,78,23,46,15,84,92,60,66,33)))
colnames(desired_result) <- c("Country","Age","Income","Group_code")
mydata$Group_code <- with(mydata, as.integer(interaction(Country, Age, Income))) should do it.
Here is left_join option using dplyr
library(dplyr)
grpD <- group_grid %>%
mutate_if(is.factor, as.character) %>% #change to character class as joining
mutate(income = as.character(income))#with same class columns are reqd.
mydata %>%
mutate_if(is.factor, as.character) %>% #change class here too
left_join(., grpD, by= c("Country" = "state", "Age" = "age", "Income" = "income"))
# Country Age Income val
#1 FR 20 1 2
#2 UK 80 4 78
#3 UK 20 2 23
#4 IT 40 3 46
#5 DE 60 1 15
#6 ES 20 5 84
#7 FR 60 5 92
#8 DE 80 3 60
#9 IT 40 4 66
#10 UK 60 2 33