Pass variable as column name to dplyr? - r

I have a very ugly dataset that is a flat file of a relational database. A minimal reproducible example is here:
df <- data.frame(col1 = c(letters[1:4],"c"),
col1.p = 1:5,
col2 = c("a","c","l","c","l"),
col2.p = 6:10,
col3= letters[3:7],
col3.p = 11:20)
I need to be able to identify the '.p' value for the 'col#' that has the "c". My previous question on SO got the first part: In R, find the column that contains a string in for each row. Which I'm providing for context.
tmp <- which(projectdata=='Transmission and Distribution of Electricity', arr.ind=TRUE)
cnt <- ave(tmp[,"row"], tmp[,"row"], FUN=seq_along)
maxnames <- paste0("max",sequence(max(cnt)))
projectdata[maxnames] <- NA
projectdata[maxnames][cbind(tmp[,"row"],cnt)] <- names(projectdata)[tmp[,"col"]]
rm(tmp, cnt, maxnames)
This results in a dataframe that looks like this:
df
col1 col1.p col2 col2.p col3 col3.p max1
1 a 1 a 6 c 11 col3
2 b 2 c 7 d 12 col2
3 c 3 l 8 e 13 col1
4 d 4 c 9 f 14 col2
5 c 5 l 10 g 15 col1
6 a 1 a 6 c 16 col3
7 b 2 c 7 d 17 col2
8 c 3 l 8 e 18 col1
9 d 4 c 9 f 19 col2
10 c 5 l 10 g 20 col1
When I tried to get the ".p" that matched the value in "max1", I kept getting errors. I thought the approach would be:
df %>%
mutate(my.p = eval(as.name(paste0(max1,'.p'))))
Error: object 'col3.p' not found
Clearly, this did not work, so I thought maybe this was similar to passing a column name in a function, where I need to use 'get'. That also didn't work.
df %>%
mutate(my.p = get(as.name(paste0(max1,'.p'))))
Error: invalid first argument
df %>%
mutate(my.p = get(paste0(max1,'.p')))
Error: object 'col3.p' not found
I found something that gets rid of this error, using data.table from a different, but related problem, here: http://codereply.com/answer/7y2ra3/dplyr-error-object-found-using-rle-mutate.html. However, it gives me "col3.p" for every row. This is max1 for the first row, df$max1[1]
library('dplyr')
library('data.table') # must have the data.table package
df %>%
tbl_dt(df) %>%
mutate(my.p = get(paste0(max1,'.p')))
Source: local data table [10 x 8]
col1 col1.p col2 col2.p col3 col3.p max1 my.p
1 a 1 a 6 c 11 col3 11
2 b 2 c 7 d 12 col2 12
3 c 3 l 8 e 13 col1 13
4 d 4 c 9 f 14 col2 14
5 c 5 l 10 g 15 col1 15
6 a 1 a 6 c 16 col3 16
7 b 2 c 7 d 17 col2 17
8 c 3 l 8 e 18 col1 18
9 d 4 c 9 f 19 col2 19
10 c 5 l 10 g 20 col1 20
Using the lazyeval interp approach (from this SO: Hot to pass dynamic column names in dplyr into custom function?) doesn't work for me. Perhaps I am implementing it incorrectly?
library(lazyeval)
library(dplyr)
df %>%
mutate_(my.p = interp(~colp, colp = as.name(paste0(max1,'.p'))))
I get an error:
Error in paste0(max1, ".p") : object 'max1' not found
Ideally, I will have the new column my.p equal the appropriate p based on the column identified in max1.
I can do this all with ifelse, but I am trying to do it with less code and to make it applicable to the next ugly flat table.

We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the the row sequence, we get the value of the paste output, and assign (:=) it to a new column ('my.p').
library(data.table)
setDT(df)[, my.p:= get(paste0(max1, '.p')), 1:nrow(df)]
df
# col1 col1.p col2 col2.p col3 col3.p max1 my.p
# 1: a 1 a 6 c 11 col3 11
# 2: b 2 c 7 d 12 col2 7
# 3: c 3 l 8 e 13 col1 3
# 4: d 4 c 9 f 14 col2 9
# 5: c 5 l 10 g 15 col1 5
# 6: a 1 a 6 c 16 col3 16
# 7: b 2 c 7 d 17 col2 7
# 8: c 3 l 8 e 18 col1 3
# 9: d 4 c 9 f 19 col2 9
#10: c 5 l 10 g 20 col1 5

Related

Strsplit on a column of a data frame [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have a data.frame where one of the variables is a vector (or a list), like this:
MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)
# group_id val cat
# 1 1 11 A, B,C
# 2 2 12 D,E
# 3 3 13 F
# 4 4 14 G
I'd like to have a new data frame with as many rows as the vector
FlatColumn <- unlist(strsplit(MyColumn,split=","))
which looks like this:
MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.
Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.
You can use separate_rows from tidyr:
tidyr::separate_rows(MyDF, cat)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
How about
lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst) ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
val = rep.int(MyDF$val, k),
cat = FlatColumn)
# group_id val cat
#1 1 11 A
#2 1 11 B
#3 1 11 C
#4 2 12 D
#5 2 12 E
#6 3 13 F
#7 4 14 G
We can use cSplit from splitstackshape
library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
# group_id val cat
#1: 1 11 A
#2: 1 11 B
#3: 1 11 C
#4: 2 12 D
#5: 2 12 E
#6: 3 13 F
#7: 4 14 G
We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.
lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))

How to remove individuals with fewer than 5 observations from a data frame [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed last month.
To clarify the question I'll briefly describe the data.
Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.
In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.
An example using group_by and filter from dplyr package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Another way to do the same thing using the data.table package.
library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))
dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]
dt1
id var count
1: 2 94 5
2: 2 22 5
3: 3 64 5
4: 4 13 6
5: 4 37 6
6: 4 2 6
7: 3 36 5
8: 3 81 5
9: 3 90 5
10: 2 17 5
11: 4 72 6
12: 2 57 5
13: 3 67 5
14: 4 9 6
15: 2 60 5
16: 4 34 6
dt2
id var count
1: 1 26 4
2: 1 31 4
3: 1 44 4
4: 1 54 4
It can be also with data.table using a logical condition with if after grouping by 'id'
library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
# id foo
# 1: b 0.9962453
# 2: b 0.8980123
# 3: b 0.1535324
# 4: b 0.2802848
# 5: b 0.9366375
# 6: c 0.8109557
# 7: c 0.6945285
# 8: c 0.1012925
# 9: c 0.6822955
#10: c 0.3757085
#11: c 0.7348635
#12: c 0.3026395
#13: c 0.9707223
data
df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262,
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557,
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395,
0.9707223)), .Names = c("id", "foo"), class = "data.frame",
row.names = c(NA, -15L))
you can also use table. take for instance the data.frame mtcars
table(mtcars$cyl)
you will see that cyl has 3 values 4 6 8. there are 7 cars with 6 cylinders and if you want to exclude observations with less than 10 then you can exclude the cars with 6 cylinders like that
mtcars[!mtcars$cyl%in%names(table(mtcars$cyl)[table(mtcars$cyl)<10]),]
this will exclude observations using %in% names and table alone

How to matching missing IDs?

I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]

using a lookup table in R with varying counts of data

Hiiii, I've been working on this issue all weekend. I'm trying to do a simple lookup, but my lookup table has different counts of data per lookup key.
Let's say I have two tables:
Table1: (there are some extra columns of data, but irrelevant to my problem)
Table1: (sample of 3 rows)
GeneName
col1 col2
HGGR .554444
BRAC4 .333222
FAM34 .111222
My lookup table is table of Gene groups followed by their respective genes. The lookup table can varying amount of columns depending on how many genes are in the group... This is a small example, the table often has 20-30 genes per group...
Table2: (example of 2 rows)
GeneGroupName
col1 col2 col3
CHR1_45000_46000 HGGR BRAC4
CHR1_67000_70000 FAM34
What I want is another column in Table1 which shows the corresponding gene group!
FinalResultTable
col1 col2 col3
CHR1_45000_46000 HGGR .554444
CHR1_45000_46000 BRAC4 .333222
CHR1_67000_70000 FAM34 .111222
The code I have so far is:
finalresult<-cbind( gene_group[match(table1[,1], gene_group[,2]),1], table1)
but of course that only works for genes found in the 2nd column of the gene group table! I need it to search thru the whole table and return the row number....
Any help? Thanks in advance
David
One way to do it is to convert your Table 2 to long format, with a column for GeneGroupName and a single column for the member genes, and then use match.
(table1 <- data.frame(GeneName=sample(LETTERS[1:12]), col2=runif(12)))
# GeneName col2
# 1 F 0.6116285
# 2 L 0.5752088
# 3 J 0.7499011
# 4 D 0.9405068
# 5 A 0.9360968
# 6 K 0.6549850
# 7 I 0.7070163
# 8 E 0.3521952
# 9 C 0.4234293
# 10 G 0.7750203
# 11 B 0.1418680
# 12 H 0.6632382
(table2 <- data.frame(GeneGroupName=1:4, g1=LETTERS[1:4], g2=LETTERS[5:8],
g3=LETTERS[9:12]))
# GeneGroupName g1 g2 g3
# 1 1 A E I
# 2 2 B F J
# 3 3 C G K
# 4 4 D H L
(table2.long <- reshape(table2, direction='long', varying=list(-1), timevar='gene'))
# GeneGroupName gene g1 id
# 1.1 1 1 A 1
# 2.1 2 1 B 2
# 3.1 3 1 C 3
# 4.1 4 1 D 4
# 1.2 1 2 E 1
# 2.2 2 2 F 2
# 3.2 3 2 G 3
# 4.2 4 2 H 4
# 1.3 1 3 I 1
# 2.3 2 3 J 2
# 3.3 3 3 K 3
# 4.3 4 3 L 4
table1$grp <- table2.long$GeneGroupName[match(table1$GeneName,
table2.long$g1)]
table1
# GeneName col2 GeneGroupName
# 1 F 0.6116285 2
# 2 L 0.5752088 4
# 3 J 0.7499011 2
# 4 D 0.9405068 4
# 5 A 0.9360968 1
# 6 K 0.6549850 3
# 7 I 0.7070163 1
# 8 E 0.3521952 1
# 9 C 0.4234293 3
# 10 G 0.7750203 3
# 11 B 0.1418680 2
# 12 H 0.6632382 4
On solution could be to use the data.table package.
Reproducing an atomic example:
table1 = data.table(col1=c("HGGR","BRAC4","FAM34"),col2=c(.55,.33,.11))
table2 = data.table(col2=c("HGGR","FAM34"),col1=c("CHR1_45000_46000", "CHR1_67000_70000"), col3=c("BRAC4",NA))
# > table1
# col1 col2
# 1: BRAC4 0.33
# 2: FAM34 0.11
# 3: HGGR 0.55
# > table2
# col2 col1 col3
# 1: HGGR CHR1_45000_46000 BRAC4
# 2: FAM34 CHR1_67000_70000 NA
First deal with the second data.table to merge col2 and col3 with melt:
table2=melt(table2, id=c("col1"), value.name="col2", na.rm=TRUE)
table2[,variable:=NULL]
Then merge the two data.table to get the wanted result:
setkey(table1, col1)
setkey(table2, col2)
table2[table1]
# col2 col1 col2.1
# BRAC4 CHR1_45000_46000 0.33
# FAM34 CHR1_67000_70000 0.11
# HGGR CHR1_45000_46000 0.55
Modying #jbaums's sample data a bit (adding NA in table2), here is one way with dplyr and tidyr.
table1 <- data.frame(GeneName=sample(LETTERS[1:12]), col2=runif(12),
stringsAsFactors = FALSE)
table2 <- data.frame(GeneGroupName=1:4, g1=LETTERS[1:4], g2=LETTERS[5:8],
g3=c(LETTERS[9:11], NA), stringsAsFactors = FALSE)
table2 %>%
gather(gene, whatever, - GeneGroupName) %>%
left_join(., table1, by = c("whatever" = "GeneName")) %>%
select(-gene, GeneGroupName, gene = whatever, value = col2)
# GeneGroupName gene value
#1 1 A 0.9926841
#2 2 B 0.3531973
#3 3 C 0.6547239
#4 4 D 0.4781180
#5 1 E 0.1293723
#6 2 F 0.6334933
#7 3 G 0.2132081
#8 4 H 0.5987610
#9 1 I 0.7317925
#10 2 J 0.9761707
#11 3 K 0.9240745
#12 4 <NA> NA

How to exclude a set of elements in R?

I have two data frames: A and B of the same number of columns names and content. Data frame B is the subset of A. I want to get A without B. I have tried different functions like setdiff, duplicated, which and others. None of them worked for me, perhaps I didn't use them correctly. Any help is appreciated.
You could use merge e.g.:
df1 <- data.frame(col1=c('A','B','C','D','E'),col2=1:5,col3=11:15)
subset <- df1[c(2,4),]
subset$EXTRACOL <- 1 # use a column name that is not present among
# the original data.frame columns
merged <- merge(df1,subset,all=TRUE)
dfdifference <- merged[is.na(merged$EXTRACOL),]
dfdifference$EXTRACOL <- NULL
-----------------------------------------
> df1:
col1 col2 col3
1 A 1 11
2 B 2 12
3 C 3 13
4 D 4 14
5 E 5 15
> subset:
col1 col2 col3
2 B 2 12
4 D 4 14
> dfdifference:
col1 col2 col3
1 A 1 11
3 C 3 13
5 E 5 15

Resources