R: Extract list columns based on column names and patterns - r

I have a list (here only sample data)
my_list <- list(structure(list(sample = c(2L, 6L), data1 = c(56L, 78L),
data2 = c(59L, 27L), data3 = c(90L, 28L), data1namet = structure(c(1L,
1L), .Label = "Sam1", class = "factor"), data2namab = structure(c(1L,
1L), .Label = "Test2", class = "factor"), dataame = structure(c(1L,
1L), .Label = "Ex3", class = "factor"), ma = c("Jay", "Jay"
)), .Names = c("sample", "data1", "data2", "data3", "data1namet",
"data2namab", "dataame", "ma"), row.names = c(NA, -2L), class = "data.frame"),
structure(list(sample = c(12L, 13L, 17L), data1 = c(56L,
78L, 3L), data2 = c(59L, 27L, 2L), datest = structure(c(1L,
1L, 1L), .Label = "Exa9", class = "factor"), dattestr = structure(c(1L,
1L, 1L), .Label = "cz1", class = "factor"), add = c(2, 2,
2)), .Names = c("sample", "data1", "data2", "datest", "dattestr",
"add"), row.names = c(NA, -3L), class = "data.frame"))
my_list
[[1]]
sample data1 data2 data3 data1namet data2namab dataame ma
1 2 56 59 90 Sam1 Test2 Ex3 Jay
2 6 78 27 28 Sam1 Test2 Ex3 Jay
[[2]]
sample data1 data2 datest dattestr add
1 12 56 59 Exa9 cz1 2
2 13 78 27 Exa9 cz1 2
3 17 3 2 Exa9 cz1 2
I've got two problems:
I would like to extract columns in this list based on patterns of their column names, e.g. all columns which contain the word 'data' in their column name. I wasn't able to find a solution with grep.
I know how to extract one column based on their index number (see example below), but how could I do this selection directly based on the column name (not the column number)?
out <- lapply(my_list, `[`, 1) # extract "sample" column

Try
lapply(my_list, function(df) df[, grep("data", names(df), fixed = TRUE)] )
# [[1]]
# data1 data2 data3 data1namet data2namab dataame
# 1 56 59 90 Sam1 Test2 Ex3
# 2 78 27 28 Sam1 Test2 Ex3
#
# [[2]]
# data1 data2
# 1 56 59
# 2 78 27
# 3 3 2
lapply(my_list, "[", "sample")
# [[1]]
# sample
# 1 2
# 2 6
#
# [[2]]
# sample
# 1 12
# 2 13
# 3 17

Related

Adding multiple dataframes with same column names based on specific column values in R

I have multiple dataframes with identical column names and dimension. :
df1
device_id price tax
1 a 200 5
2 b 100 2
3 c 50 1
df2
device_id price tax
1 b 200 7
2 a 100 3
3 c 50 1
df3
device_id price tax
1 c 50 5
2 b 300 1
3 a 50 2
What I want to do is to create another dataframe df where I will add the price and tax values from the above three dataframes with matching device_ids.
So,
df would be like
df
device_id price tax
1 a 350 10
2 b 600 10
3 c 150 7
How can I do it? Also, it would be great if the solution can be generalized to larger number of dataframes instead of just 3.
First, get all your data frames into a list (called dflist here, defined below). Then it's easy to do with aggregate() after row-binding the list elements.
aggregate(. ~ device_id, do.call(rbind, dflist), sum)
# device_id price tax
# 1 a 350 10
# 2 b 600 10
# 3 c 150 7
Or you could use the data.table package.
library(data.table)
rbindlist(dflist)[, lapply(.SD, sum), by = device_id]
# device_id price tax
# 1: a 350 10
# 2: b 600 10
# 3: c 150 7
Or dplyr.
library(dplyr)
bind_rows(dflist) %>%
group_by(device_id) %>%
summarize_each(funs(sum))
# Source: local data frame [3 x 3]
#
# device_id price tax
# <fctr> <int> <int>
# 1 a 350 10
# 2 b 600 10
# 3 c 150 7
Data:
dflist <- structure(list(df1 = structure(list(device_id = structure(1:3, .Label = c("a",
"b", "c"), class = "factor"), price = c(200L, 100L, 50L), tax = c(5L,
2L, 1L)), .Names = c("device_id", "price", "tax"), class = "data.frame", row.names = c("1",
"2", "3")), df2 = structure(list(device_id = structure(c(2L,
1L, 3L), .Label = c("a", "b", "c"), class = "factor"), price = c(200L,
100L, 50L), tax = c(7L, 3L, 1L)), .Names = c("device_id", "price",
"tax"), class = "data.frame", row.names = c("1", "2", "3")),
df3 = structure(list(device_id = structure(c(3L, 2L, 1L), .Label = c("a",
"b", "c"), class = "factor"), price = c(50L, 300L, 50L),
tax = c(5L, 1L, 2L)), .Names = c("device_id", "price",
"tax"), class = "data.frame", row.names = c("1", "2", "3"
))), .Names = c("df1", "df2", "df3"))
We can use by from base R after rbinding after we place all the data.frame objects in a list (mget(paste0("df", 1:3)))
dfN <- do.call(rbind, mget(paste0("df", 1:3)))
do.call(rbind, by(dfN[-1], dfN[1], FUN = colSums))

How to intersect values from two data frames with R

I would like to create a new column for a data frame with values from the intersection of a row and a column.
I have a data.frame called "time":
q 1 2 3 4 5
a 1 13 43 5 3
b 2 21 12 3353 34
c 3 21 312 123 343
d 4 123 213 123 35
e 4556 11 123 12 3
And another table, called "event":
q dt
a 1
b 3
c 4
d 2
e 1
I want to put another column called inter on the second table that will be fill the values that are in the intersection between the q and the columns dt from the first data.frame. So the result would be this:
q dt inter
a 1 1
b 3 12
c 4 123
d 2 123
e 1 4556
I have tried to use merge(event, time, by.x = "q", by.y = "dt"), but it generate the error that they aren't the same id. I have also tried to transpose the time data.frame to cross section the values but I didn't have success.
library(reshape2)
merge(event, melt(time, id.vars = "q"),
by.x=c('q','dt'), by.y=c('q','variable'), all.x = TRUE)
Output:
q dt value
1 a 1 1
2 b 3 12
3 c 4 123
4 d 2 123
5 e 1 4556
Notes
We use the function melt from the package reshape2 to convert the data frame time from wide to long format. And then we merge (left outer join) the data frames event and the melted time by two columns (q and dt in event, q and variable in the melted time) .
Data:
time <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), `1` = c(1L, 2L, 3L, 4L, 4556L), `2` = c(13L,
21L, 21L, 123L, 11L), `3` = c(43L, 12L, 312L, 213L, 123L), `4` = c(5L,
3353L, 123L, 123L, 12L), `5` = c(3L, 34L, 343L, 35L, 3L)), .Names = c("q",
"1", "2", "3", "4", "5"), class = "data.frame", row.names = c(NA,
-5L))
event <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), dt = c(1L, 3L, 4L, 2L, 1L)), .Names = c("q",
"dt"), class = "data.frame", row.names = c(NA, -5L))
This may be a little clunky but it works:
inter=c()
for (i in 1:nrow(time)) {
xx=merge(time,event,by='q')
dt=xx$dt
z=y[i,dt[i]+1]
inter=c(inter,z)
final=cbind(time[,1],dt,inter)
}
colnames(final)=c('q','dt','inter')
Hope it helps.
Output:
q dt inter
[1,] 1 1 1
[2,] 2 3 12
[3,] 3 4 123
[4,] 4 2 123
[5,] 5 1 4556

Aggregate while subtracting values in R

If I have two data frames in R (let's call them df1 and df2 respectively) such as
> df1
state num1
AL 22
AK 49
AZ 48
AR 25
and
> df2
state num2
AK 2
AZ 3
AR 4
CA 5
how do I aggregate those data frames while subtracting the values to form something like
state num3
AL 22
AK 47
AZ 45
AR 21
CA -5
Note: the key values are not the same in the data frames; the data frames have different numbers of rows
There may be an easier way to get there, but here's a possibility. We can merge() the two data frames, then subtract the columns after replacing the NA values with zero.
m <- merge(df1, df2, all = TRUE)
cbind(m[1], num3 = with(replace(m, is.na(m), 0L), num1 - num2))
# state num3
# 1 AK 47
# 2 AL 22
# 3 AR 21
# 4 AZ 45
# 5 CA -5
Data:
df1 <- structure(list(state = structure(c(2L, 1L, 4L, 3L), .Label = c("AK",
"AL", "AR", "AZ"), class = "factor"), num1 = c(22L, 49L, 48L,
25L)), .Names = c("state", "num1"), row.names = c(NA, 4L), class = "data.frame")
df2 <- structure(list(state = structure(c(1L, 3L, 2L, 4L), .Label = c("AK",
"AR", "AZ", "CA"), class = "factor"), num2 = 2:5), .Names = c("state",
"num2"), row.names = 2:5, class = "data.frame")
One way with dplyr would be the following. You combine the two data frame with full_join. Then, you replace NA with 0. Then, you handle the subtraction, which is done in the mutate() part. Finally, choose the necessary columns with select().
DATA
mydf1 <- structure(list(state = structure(c(2L, 1L, 4L, 3L), .Label = c("AK",
"AL", "AR", "AZ"), class = "factor"), num1 = c(22L, 49L, 48L,
25L)), .Names = c("state", "num1"), class = "data.frame", row.names = c(NA,
-4L))
mydf2 <- structure(list(state = structure(c(1L, 3L, 2L, 4L), .Label = c("AK",
"AR", "AZ", "CA"), class = "factor"), num2 = 2:5), .Names = c("state",
"num2"), class = "data.frame", row.names = c(NA, -4L))
CODE
full_join(mydf1, mydf2, by = c("state" = "state")) %>%
mutate_each(funs(replace(., which(. %in% NA), 0)), num1:num2) %>%
mutate(num3 = num1 - num2) %>%
select(state, num3)
# state num3
#1 AL 22
#2 AK 47
#3 AZ 45
#4 AR 21
#5 CA -5
Instead of merging the data frames, combining the rows. First we change the sign of the column num2 and then we aggregate the results by state:
Base package:
aggregate(num1 ~ state,
data = rbind(df1, setNames(data.frame(df2[1], -df2[2]), names(df1))),
FUN = sum)
Output:
state num1
1 AK 47
2 AL 22
3 AR 21
4 AZ 45
5 CA -5
dplyr:
library(dplyr)
rbind(df1, setNames(data.frame(df2[1], -df2[2]), names(df1))) %>%
group_by(state) %>%
summarise(sum = sum(num1))

Extract data.frame column names from rows in data.frame

I have a couple of data.frames which have approximately the same structure. For a reproducible example I created two sample dataframes df1 and df2.
df1 <- structure(list(sample = c(2L, 6L), data1 = c(56L, 78L), data2 = c(59L,
27L), data6 = c(90L, 28L), data1namet = structure(c(1L, 1L), .Label = "Sam1", class = "factor"),
data2namab = structure(c(1L, 1L), .Label = "Test2", class = "factor"),
dataame = structure(c(1L, 1L), .Label = "Ex3", class = "factor")), .Names = c("sample",
"data1", "data2", "data3", "data1namet", "data2namab", "dataame"
), class = "data.frame", row.names = c(NA, -2L))
df1
sample data1 data2 data3 data1namet data2namab dataame
1 2 56 59 90 Sam1 Test2 Ex3
2 6 78 27 28 Sam1 Test2 Ex3
df2 <- structure(list(sample = c(12L, 13L, 17L), data1 = c(56L, 78L,
3L), data2 = c(59L, 27L, 2L), datest = structure(c(1L, 1L,
1L), .Label = "Exa9", class = "factor"), dattestr = structure(c(1L,
1L, 1L), .Label = "cz1", class = "factor")), .Names = c("sample",
"data1", "data2", "datest", "dattestr"), class = "data.frame", row.names = c(NA,
-3L))
df2
sample data1 data2 datest dattestr
1 12 56 59 Exa9 cz1
2 13 78 27 Exa9 cz1
3 17 3 2 Exa9 cz1
The name of the data is saved in the columns after the data columns and I was wondering if there is a way I could restructure the data.frames (about 40 data.frames) that they contain the name of the data in their column name?
df1
sample Sam1 Test2 Ex3
1 2 56 59 90
2 6 78 27 28
and
df2
sample Exa9 cz1
1 12 56 59
2 13 78 27
3 17 3 2
EDIT
As I just realised I also have other columns after the data columns so that my input data looks like this
df1 <- structure(list(sample = c(2L, 6L), data1 = c(56L, 78L), data2 = c(59L,
27L), data3 = c(90L, 28L), data1namet = structure(c(1L, 1L), .Label = "Sam1", class = "factor"),
data2namab = structure(c(1L, 1L), .Label = "Test2", class = "factor"),
dataame = structure(c(1L, 1L), .Label = "Ex3", class = "factor"),
ma = c("Jay", "Jay")), .Names = c("sample", "data1", "data2",
"data3", "data1namet", "data2namab", "dataame", "ma"), row.names = c(NA,
-2L), class = "data.frame")
df1
sample data1 data2 data3 data1namet data2namab dataame ma
1 2 56 59 90 Sam1 Test2 Ex3 Jay
2 6 78 27 28 Sam1 Test2 Ex3 Jay
df2 <- structure(list(sample = c(12L, 13L, 17L), data1 = c(56L, 78L,
3L), data2 = c(59L, 27L, 2L), datest = structure(c(1L, 1L, 1L
), .Label = "Exa9", class = "factor"), dattestr = structure(c(1L,
1L, 1L), .Label = "cz1", class = "factor"), add = c(2, 2, 2)), .Names = c("sample",
"data1", "data2", "datest", "dattestr", "add"), row.names = c(NA,
-3L), class = "data.frame")
df2
sample data1 data2 datest dattestr add
1 12 56 59 Exa9 cz1 2
2 13 78 27 Exa9 cz1 2
3 17 3 2 Exa9 cz1 2
In this case the ma and add column are not part of the data and should be added at the end like this:
df1
sample Sam1 Test2 Ex3 ma
1 2 56 59 90 Jay
2 6 78 27 28 Jay
and
df2
sample Exa9 cz1 add
1 12 56 59 2
2 13 78 27 2
3 17 3 2 2
One could start by identifying which columns should be kept:
keep_col <- which(sapply(df2, is.numeric))
After that, some work is required to extract the new column names and to rename the corresponding columns in the data frame:
names <- df2[1,keep_col[-1] + length(keep_col)-1]
colnames(df2)[keep_col[-1]] <- as.character(unlist(names))
Finally, the dataframe can be reassembled by keeping only the desired columns:
df2 <- df2[,keep_col]
#> df2
# sample Exa9 cz1
#1 12 56 59
#2 13 78 27
#3 17 3 2
In order to use this transformation for several different dataframes, the code can be wrapped into a function:
summarize_table <- function(x){
keep_col <- which(sapply(x, is.numeric))
names <- x[1,keep_col[-1] + length(keep_col)-1]
colnames(x)[keep_col[-1]] <- as.character(unlist(names))
x <- x[,keep_col]
}
If the various dataframes are stored in a list, the function summarize_table() can be used with lapply() to obtain the results for each dataframe:
my_dfs <- list(df1,df2)
out <- lapply(my_dfs,summarize_table)
#> out
#[[1]]
# sample Sam1 Test2 Ex3
#1 2 56 59 90
#2 6 78 27 28
#
#[[2]]
# sample Exa9 cz1
#1 12 56 59
#2 13 78 27
#3 17 3 2
EDIT / ADDENDUM
The modified version below should be able to handle also the cases mentioned in the revised post:
summarize_tab2 <- function(x){
keep_col <- which(sapply(x, is.numeric))
first_block <- c(keep_col[1],keep_col[which(diff(keep_col)==1)])
add_col <- FALSE
if (2 * (length(keep_col) - 1) + 1 < ncol(x)) add_col <- TRUE
keep_col1 <- keep_col[1:length(first_block)]
names <- x[1,keep_col1[-1] + length(keep_col1) - 1]
colnames(x)[keep_col1[-1]] <- as.character(unlist(names))
df_t <- x[,keep_col]
if (add_col) df_t <- cbind(df_t, x[(2 * (ncol(df_t) - 1) + 2):ncol(x)])
return(df_t)
}
my_dfs <- list(df1, df2, df3, df4)
out <- lapply(my_dfs, summarize_tab2)
#> out
#[[1]]
# sample Sam1 Test2 Ex3 ma
#1 2 56 59 90 Jay
#2 6 78 27 28 Jay
#
#[[2]]
# sample Exa9 cz1 add
#1 12 56 59 2
#2 13 78 27 2
#3 17 3 2 2
#
#[[3]]
# sample Sam1 Test2 Ex3
#1 2 56 59 90
#2 6 78 27 28
#
#[[4]]
# sample Exa9 cz1
#1 12 56 59
#2 13 78 27
#3 17 3 2
Here the dataframes df3 and df4 are, respectively, the data frames df1and df2 of the original post.
The following should work:
library(plyr)
cols.to.rename <- grep('^data(.)$', colnames(df1))
cols.of.names <- max(cols.to.rename)+seq(1,length(cols.to.rename))
the.names <- lapply(df1[1,cols.of.names], as.character)
df1.mod <- df1
colnames(df1.mod)[cols.to.rename] <- the.names
df1.mod <- df1.mod[-cols.of.names]
It renames all dataX columns to the (first) value in the columns following the last dataX column. It then drops all name columns from the data frame.

R: Compare and filtering of two data.frames based on conditions

I have a data.frame data_qual that looks like this:
data_qual <- structure(list(NAME = structure(1:3, .Label = c("NAME1", "NAME2", "NAME3"), class = "factor"), ID = c(56L, 47L, 77L), YEAR = c(1990L, 2007L, 1899L), VALUE = structure(c(2L, 1L, 1L), .Label = c("ST", "X"), class = "factor")), .Names = c("NAME", "ID", "YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -3L))
NAME ID YEAR VALUE
1 NAME1 56 1990 X
2 NAME2 47 2007 ST
3 NAME3 77 1899 ST
I'd like to filter out values from data_qual by comparing it to another dataframe dat:
dat <- structure(list(NAME = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("NAME1","NAME2"), class = "factor"), ID = c(56L, 56L, 56L, 47L, 47L, 47L, 47L), YEAR = c(1988L, 1989L, 1991L, 2005L, 2006L, 2007L, 2008L), VALUE = c(45L, 28L, 28L, -12L, 14L, 23L, 32L)), .Names = c("NAME", "ID", "YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -7L))
NAME ID YEAR VALUE
1 NAME1 56 1988 45
2 NAME1 56 1989 28
3 NAME1 56 1991 28
4 NAME2 47 2005 -12
5 NAME2 47 2006 14
6 NAME2 47 2007 23
7 NAME2 47 2008 32
How could I filter data_qual based on the column ID so that in a first filtering process only rows are written to a new data.frame that have a matching ID with dat?
NAME ID YEAR VALUE
1 NAME1 56 1990 X
2 NAME2 47 2007 ST
Then after that I am looking for a way that from the resulting data.frame only rows should be written out that don't have the same YEAR per group (as defined by ID)
NAME ID YEAR VALUE
1 NAME1 56 1990 X
Any help is kindly appreciated.
For the first part
dat2 <- data_qual[data_qual$ID %in% dat$ID, ]
dat2
NAME ID YEAR VALUE
1 NAME1 56 1990 X
2 NAME2 47 2007 ST
And then for the second part
good_rows <- lapply(paste(dat2$ID, dat2$YEAR, sep = ":"), grepl, x = paste(dat$ID, dat$YEAR, sep = ":"))
dat3 <- dat2[!unlist(lapply(good_rows, any)), ]
Or if that's too messy for you, a for loop
good_rows <- vector(length = nrow(dat2))
for (i in 1:nrow(dat2)) {
good_rows[i] <- !any(grepl(dat2$YEAR[i], dat[dat$ID == dat2$ID[i], "YEAR"]))
}
dat3 <- dat2[good_rows, ]
dat3
NAME ID YEAR VALUE
1 NAME1 56 1990 X

Resources