If I have two data frames in R (let's call them df1 and df2 respectively) such as
> df1
state num1
AL 22
AK 49
AZ 48
AR 25
and
> df2
state num2
AK 2
AZ 3
AR 4
CA 5
how do I aggregate those data frames while subtracting the values to form something like
state num3
AL 22
AK 47
AZ 45
AR 21
CA -5
Note: the key values are not the same in the data frames; the data frames have different numbers of rows
There may be an easier way to get there, but here's a possibility. We can merge() the two data frames, then subtract the columns after replacing the NA values with zero.
m <- merge(df1, df2, all = TRUE)
cbind(m[1], num3 = with(replace(m, is.na(m), 0L), num1 - num2))
# state num3
# 1 AK 47
# 2 AL 22
# 3 AR 21
# 4 AZ 45
# 5 CA -5
Data:
df1 <- structure(list(state = structure(c(2L, 1L, 4L, 3L), .Label = c("AK",
"AL", "AR", "AZ"), class = "factor"), num1 = c(22L, 49L, 48L,
25L)), .Names = c("state", "num1"), row.names = c(NA, 4L), class = "data.frame")
df2 <- structure(list(state = structure(c(1L, 3L, 2L, 4L), .Label = c("AK",
"AR", "AZ", "CA"), class = "factor"), num2 = 2:5), .Names = c("state",
"num2"), row.names = 2:5, class = "data.frame")
One way with dplyr would be the following. You combine the two data frame with full_join. Then, you replace NA with 0. Then, you handle the subtraction, which is done in the mutate() part. Finally, choose the necessary columns with select().
DATA
mydf1 <- structure(list(state = structure(c(2L, 1L, 4L, 3L), .Label = c("AK",
"AL", "AR", "AZ"), class = "factor"), num1 = c(22L, 49L, 48L,
25L)), .Names = c("state", "num1"), class = "data.frame", row.names = c(NA,
-4L))
mydf2 <- structure(list(state = structure(c(1L, 3L, 2L, 4L), .Label = c("AK",
"AR", "AZ", "CA"), class = "factor"), num2 = 2:5), .Names = c("state",
"num2"), class = "data.frame", row.names = c(NA, -4L))
CODE
full_join(mydf1, mydf2, by = c("state" = "state")) %>%
mutate_each(funs(replace(., which(. %in% NA), 0)), num1:num2) %>%
mutate(num3 = num1 - num2) %>%
select(state, num3)
# state num3
#1 AL 22
#2 AK 47
#3 AZ 45
#4 AR 21
#5 CA -5
Instead of merging the data frames, combining the rows. First we change the sign of the column num2 and then we aggregate the results by state:
Base package:
aggregate(num1 ~ state,
data = rbind(df1, setNames(data.frame(df2[1], -df2[2]), names(df1))),
FUN = sum)
Output:
state num1
1 AK 47
2 AL 22
3 AR 21
4 AZ 45
5 CA -5
dplyr:
library(dplyr)
rbind(df1, setNames(data.frame(df2[1], -df2[2]), names(df1))) %>%
group_by(state) %>%
summarise(sum = sum(num1))
Related
There are two dataframes which I want to connect.
So because I have 2 dimensions to filter a value from the column of the second table which meets some conditions of the first table.
The first dataframe looks like this:
letter year value
A 2001
B 2002
C 2003
D 2004
second one:
letter 2001 2002 2003 2004
A 4 9 9 9
B 6 7 6 6
C 2 3 5 8
D 1 1 1 1
which gives me something like this
letter year value
A 2001 4
B 2002 7
C 2003 5
D 2004 1
thank all of you
One option is to row/column index. Here, the row index can be sequence of rows, while the column index we get from matching the 'year' column of first data with the column names of second, cbind the indexes to create a matrix ('m1') and use that to extract values from second dataset and assign those to 'value' column in first data
i1 <- seq_len(nrow(df1))
j1 <- match(df1$year, names(df2)[-1])
m1 <- cbind(i1, j1)
df1$value <- df2[-1][m1]
df1
# letter year value
#1 A 2001 4
#2 B 2002 7
#3 C 2003 5
#4 D 2004 1
For the specific example, the pattern to extract seems to be the diagonal elements, in that case, we can also use
df1$value <- diag(as.matrix(df2[-1]))
data
df1 <- structure(list(letter = c("A", "B", "C", "D"), year = 2001:2004),
class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(letter = c("A", "B", "C", "D"), `2001` = c(4L,
6L, 2L, 1L), `2002` = c(9L, 7L, 3L, 1L), `2003` = c(9L, 6L, 5L,
1L), `2004` = c(9L, 6L, 8L, 1L)), class = "data.frame",
row.names = c(NA,
-4L))
Another option in the tidyverse would be to first pivot your value data to a longer data frame (data from #akrun's answer):
df2.long <- df2 %>%
pivot_longer(`2001`:`2004`, names_to = 'year', values_to = 'value')
# A tibble: 16 x 3
letter year value
<chr> <chr> <int>
1 A 2001 4
2 A 2002 9
3 A 2003 9
4 A 2004 9
5 B 2001 6
6 B 2002 7
7 B 2003 6
8 B 2004 6
9 C 2001 2
10 C 2002 3
...
And then perform an inner_join to the data frame containing your desired letter/year combinations:
df.final <- df2.long %>%
mutate(year = as.numeric(year)) %>%
inner_join(df1)
letter year value
<chr> <dbl> <int>
1 A 2001 4
2 B 2002 7
3 C 2003 5
4 D 2004 1
Base R solution:
# Reshape your dataframe from wide to long:
df3 <- reshape(df2,
direction = "long",
idvar = "letter",
varying = c(names(df2)[names(df2) != "letter"]),
v.names = "Value",
timevar = "Year",
times = names(df2)[names(df2) != "letter"],
new.row.names = 1:(nrow(df2) * length(names(df2)[names(df2) != "letter"]))
)
# Inner join the long_df with the first dataframe:
df_final <- merge(df1[,c(names(df1) != "Value")], df3, by = intersect(colnames(df1), colnames(df3)))
Tidyverse solution (slightly expanding on #jdobres' solution below):
lapply(c("dplyr", "tidyr"), require, character.only = TRUE)
df3_long <-
df2 %>%
pivot_longer(`2001`:`2004`, names_to = 'year', values_to = 'value') %>%
mutate(year = as.numeric(year)) %>%
inner_join(., df1, by = intersect(colnames(df1, df2)))
Data:
df1 <-
structure(list(letter = c("A", "B", "C", "D"), year = 2001:2004),
class = "data.frame",
row.names = c(NA,-4L))
df2 <-
structure(
list(
letter = c("A", "B", "C", "D"),
`2001` = c(4L,
6L, 2L, 1L),
`2002` = c(9L, 7L, 3L, 1L),
`2003` = c(9L, 6L, 5L,
1L),
`2004` = c(9L, 6L, 8L, 1L)
),
class = "data.frame",
row.names = c(NA,-4L)
)
I have the following data:
store location mass target
1 1 (Ams) 45 ?
2 5 (Ber) 500 ?
3 8 (Mar) 1003 ?
In this last column target I would like to have a value from the table:
location
mass range 1 5 8
0 - 350 3 4 5
> 351 6 7 8
So the target column should contain the values, 3, 7, 8 in the first three rows.
I tried to use the function INDEX() but did not work out.. If anyone knows how to do this in R or in PowerBI that would also help me. Thanks!
In R the example is reproducable by using:
structure(list(Store = 1:3, Location = structure(c(2L, 3L, 1L
), .Label = c("08-Mar", "1 Ams", "5 Ber"), class = "factor"),
Mass = c(1000L, 800L, 500L)), class = "data.frame", row.names = c(NA,
-3L))
and
structure(list(X = structure(1:2, .Label = c("0 - 350", "351 - 1000"
), class = "factor"), X1 = c(3L, 6L), X5 = c(4L, 7L), X8 = c(5L,
8L)), class = "data.frame", row.names = c(NA, -2L))
Reform your table 2 then you could use INDEX and MATCH functions as below
In R, we require a bit of pre-processing before we can actually merge the two tables since the data is not in a standard format. Assuming the two tables are called df1 and df2 respectively, we separate the data into different columns for Location in df1 and X in df2. We also add additional "X" character in df1 so that it matches the column name of df2. We bring data in long format using gather in df2 and use fuzzy_left_join to merge by number range.
library(fuzzyjoin)
library(tidyverse)
df1 %>%
separate(Location, into = c("Loc1", "Loc2"), sep = "\\s+|-", convert = TRUE) %>%
mutate(Loc1 = paste0("X", Loc1)) %>%
fuzzy_left_join(df2 %>%
separate(X, into = c("start", "end"), convert = TRUE) %>%
gather(key, Target, starts_with("X")),
by = c("Loc1" = "key", "Mass" = "start", "Mass" = "end"),
match_fun = list(`==`, `>=`, `<=`))
# Store Loc1 Loc2 Mass start end key Target
#1 1 X1 Ams 1000 351 1000 X1 6
#2 2 X5 Ber 800 351 1000 X5 7
#3 3 X8 Mar 45 0 350 X8 5
data
df1 <- structure(list(Store = 1:3, Location = structure(c(2L, 3L, 1L
), .Label = c("08-Mar", "1 Ams", "5 Ber"), class = "factor"),
Mass = c(1000, 800, 45)), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(X = structure(1:2, .Label = c("0 - 350", "351 - 1000"
), class = "factor"), X1 = c(3L, 6L), X5 = c(4L, 7L), X8 = c(5L,
8L)), class = "data.frame", row.names = c(NA, -2L))
I am in the process of trying to make untidy data data. I have data in the following format:
name x
a NA
value 1
b NA
value 2
c NA
value 3
I would like it to be in the following format
name x
a_value 1
b_value 2
c_value 3
How can I do this in dplyr?
My first thought is to come up with a way to spread so that
name name2 x x2
a value NA 1
b value NA 2
c value NA 3
From there I know I can use unite for name and name2 and delete column x, but I am not sure if spread can produce the above.
You can group on NA and summarise, i.e.
library(dplyr)
df %>%
group_by(grp = cumsum(is.na(x))) %>%
summarise(name = paste(name, collapse = '_'))
which gives,
# A tibble: 3 x 2
grp name
<int> <chr>
1 1 a_value
2 2 b_value
3 3 c_value
DATA
dput(df)
structure(list(name = c("a", "value", "b", "value", "c", "value"
), x = c(NA, 1L, NA, 2L, NA, 3L)), .Names = c("name", "x"), row.names = c(NA,
-6L), class = "data.frame")
Use na.locf and then remove the unwanted rows:
library(dplyr)
library(zoo)
DF %>%
mutate(x = na.locf(x, fromLast = TRUE)) %>%
filter(name != "value")
giving:
name x
1 a 1
2 b 2
3 c 3
Note
DF <-
structure(list(name = structure(c(1L, 4L, 2L, 4L, 3L, 4L), .Label = c("a",
"b", "c", "value"), class = "factor"), x = c(NA, 1L, NA, 2L,
NA, 3L)), .Names = c("name", "x"), class = "data.frame", row.names = c(NA,
-6L))
I have multiple dataframes with identical column names and dimension. :
df1
device_id price tax
1 a 200 5
2 b 100 2
3 c 50 1
df2
device_id price tax
1 b 200 7
2 a 100 3
3 c 50 1
df3
device_id price tax
1 c 50 5
2 b 300 1
3 a 50 2
What I want to do is to create another dataframe df where I will add the price and tax values from the above three dataframes with matching device_ids.
So,
df would be like
df
device_id price tax
1 a 350 10
2 b 600 10
3 c 150 7
How can I do it? Also, it would be great if the solution can be generalized to larger number of dataframes instead of just 3.
First, get all your data frames into a list (called dflist here, defined below). Then it's easy to do with aggregate() after row-binding the list elements.
aggregate(. ~ device_id, do.call(rbind, dflist), sum)
# device_id price tax
# 1 a 350 10
# 2 b 600 10
# 3 c 150 7
Or you could use the data.table package.
library(data.table)
rbindlist(dflist)[, lapply(.SD, sum), by = device_id]
# device_id price tax
# 1: a 350 10
# 2: b 600 10
# 3: c 150 7
Or dplyr.
library(dplyr)
bind_rows(dflist) %>%
group_by(device_id) %>%
summarize_each(funs(sum))
# Source: local data frame [3 x 3]
#
# device_id price tax
# <fctr> <int> <int>
# 1 a 350 10
# 2 b 600 10
# 3 c 150 7
Data:
dflist <- structure(list(df1 = structure(list(device_id = structure(1:3, .Label = c("a",
"b", "c"), class = "factor"), price = c(200L, 100L, 50L), tax = c(5L,
2L, 1L)), .Names = c("device_id", "price", "tax"), class = "data.frame", row.names = c("1",
"2", "3")), df2 = structure(list(device_id = structure(c(2L,
1L, 3L), .Label = c("a", "b", "c"), class = "factor"), price = c(200L,
100L, 50L), tax = c(7L, 3L, 1L)), .Names = c("device_id", "price",
"tax"), class = "data.frame", row.names = c("1", "2", "3")),
df3 = structure(list(device_id = structure(c(3L, 2L, 1L), .Label = c("a",
"b", "c"), class = "factor"), price = c(50L, 300L, 50L),
tax = c(5L, 1L, 2L)), .Names = c("device_id", "price",
"tax"), class = "data.frame", row.names = c("1", "2", "3"
))), .Names = c("df1", "df2", "df3"))
We can use by from base R after rbinding after we place all the data.frame objects in a list (mget(paste0("df", 1:3)))
dfN <- do.call(rbind, mget(paste0("df", 1:3)))
do.call(rbind, by(dfN[-1], dfN[1], FUN = colSums))
I have a list (here only sample data)
my_list <- list(structure(list(sample = c(2L, 6L), data1 = c(56L, 78L),
data2 = c(59L, 27L), data3 = c(90L, 28L), data1namet = structure(c(1L,
1L), .Label = "Sam1", class = "factor"), data2namab = structure(c(1L,
1L), .Label = "Test2", class = "factor"), dataame = structure(c(1L,
1L), .Label = "Ex3", class = "factor"), ma = c("Jay", "Jay"
)), .Names = c("sample", "data1", "data2", "data3", "data1namet",
"data2namab", "dataame", "ma"), row.names = c(NA, -2L), class = "data.frame"),
structure(list(sample = c(12L, 13L, 17L), data1 = c(56L,
78L, 3L), data2 = c(59L, 27L, 2L), datest = structure(c(1L,
1L, 1L), .Label = "Exa9", class = "factor"), dattestr = structure(c(1L,
1L, 1L), .Label = "cz1", class = "factor"), add = c(2, 2,
2)), .Names = c("sample", "data1", "data2", "datest", "dattestr",
"add"), row.names = c(NA, -3L), class = "data.frame"))
my_list
[[1]]
sample data1 data2 data3 data1namet data2namab dataame ma
1 2 56 59 90 Sam1 Test2 Ex3 Jay
2 6 78 27 28 Sam1 Test2 Ex3 Jay
[[2]]
sample data1 data2 datest dattestr add
1 12 56 59 Exa9 cz1 2
2 13 78 27 Exa9 cz1 2
3 17 3 2 Exa9 cz1 2
I've got two problems:
I would like to extract columns in this list based on patterns of their column names, e.g. all columns which contain the word 'data' in their column name. I wasn't able to find a solution with grep.
I know how to extract one column based on their index number (see example below), but how could I do this selection directly based on the column name (not the column number)?
out <- lapply(my_list, `[`, 1) # extract "sample" column
Try
lapply(my_list, function(df) df[, grep("data", names(df), fixed = TRUE)] )
# [[1]]
# data1 data2 data3 data1namet data2namab dataame
# 1 56 59 90 Sam1 Test2 Ex3
# 2 78 27 28 Sam1 Test2 Ex3
#
# [[2]]
# data1 data2
# 1 56 59
# 2 78 27
# 3 3 2
lapply(my_list, "[", "sample")
# [[1]]
# sample
# 1 2
# 2 6
#
# [[2]]
# sample
# 1 12
# 2 13
# 3 17