Data frame manoeuvre [duplicate] - r

This question already has an answer here:
R programming - data frame manoevur
(1 answer)
Closed 7 years ago.
Suppose I have the following dataframe:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d 45
5 e 52
6 f 65
7 g 76
8 a 13
9 b 24
I'd like to turn it into a new dataframe like the following:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d+e 97
5 f 65
6 g 76
7 a 13
8 b 24
How can I do it? (Surely, the dataframe is much larger, but I want the sum of all categories of d and e and group it into a new category, say 'H').
Many thanks!

This is a good question but unfortunately OT here. So I'll answer until it get migrated.
I'm assuming Variable is of class factor, so you'll need to properly re-level it (assuming your data is called df)
levels(df$Categories)[levels(df$Categories) %in% c("d", "e")] <- "h"
Next, I'll use the data.table package as you have a large data set and it's devel version (v >= 1.9.5) has a convinient function called rleid (download from GitHub)
library(data.table) ## v >= 1.9.5
setDT(df)[, .(Variable = sum(Variable)), by = .(indx = rleid(Categories), Categories)]
# indx Categories Variable
# 1: 1 a 11
# 2: 2 b 21
# 3: 3 c 34
# 4: 4 h 97
# 5: 5 f 65
# 6: 6 g 76
# 7: 7 a 13
# 8: 8 b 24

You can try this:
# plyr package provides rbind.fill() function for row binding
library(plyr)
# Assuming you have a rows.cvs containing the data, read it into a data frame
data<-read.csv("rows.csv",stringsAsFactors=FALSE)
# Find the lowest index of d or e (whichever comes first)
index<-min(match("d",data$Var1.nominal.), match("e",data$Var1.nominal.))
# Returns all rows containing d and e in Var1(nominal) column
tempData<-data[data$Var1.nominal. %in% c("d","e"),]
# Remove all the rows containing d and e from original data frame
data<-data[!data$Var1.nominal. %in% c("d","e"),]
# Reorder row index numbers in data
rownames(data)<-NULL
# Combine rows containing d and e in Var1(nominal)column, and sum up the column Var2(numeric)
tempData<-data.frame(Var1.nominal.="d+e",Var2.numeric.=sum(tempData[,2]))
# Combine original data and tempData frame with use of index
data<-rbind.fill(data[1:(index-1),],tempData,data[index:length(data[,1]),])
# Renaming "d+e" to"h"
data[index,1]="h"
# Getting rid of the tempData data frame
rm(tempData)
Output:
> data
Var1.nominal. Var2.numeric.
1 a 11
2 b 21
3 c 34
4 h 97
5 f 65
6 g 76
7 a 13
8 b 24

Related

Gathering columns from wide to long by id [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I've got a data frame like this:
set.seed(100)
drugs <- data.frame(id = 1:5,
drug_1 = letters[1:5], drug_dos_1 = sample(100,5),
drug_2 = letters[3:7], drug_dos_2 = sample(100,5)
)
id drug_1 drug_dos_1 drug_2 drug_dos_2
1 a 31 c 49
2 b 26 d 81
3 c 55 e 37
4 d 6 f 54
5 e 45 g 17
I'd like to transform this messy table into a tidy table with all drugs of an id in one column and the corresponding drug dosages in one column. The table should look like this in the end:
id drug dosage
1 a 31
1 c 49
2 b 26
2 d 81
etc
I guess this could be achieved by using a reshaping function that transforms by data from wide to long format but I didn't manage.
One option is melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(drugs), measure = patterns('^drug_\\d+$', 'dos'),
value.name = c('drug', 'dosage'))[, variable := NULL][order(id)]
# id drug dosage
#1: 1 a 31
#2: 1 c 49
#3: 2 b 26
#4: 2 d 81
#5: 3 c 55
#6: 3 e 37
#7: 4 d 6
#8: 4 f 54
#9: 5 e 45
#10 5 g 17
Here, the 'drug' is common in all the columns, so we need to create a unique pattern. One way is to specify the starting location (^) followed by the 'drug' substring, then underscore (_) and one or more numbers (\\d+) at the end ($) of the string. For the 'dos', just use that substring to match those column names that have 'dos'
library(dplyr)
drugs %>% gather(key,val,-id) %>% mutate(key=gsub('_\\d','',key)) %>% #replace _1 and _2 at the end wiht nothing
mutate(key=gsub('drug_','',key)) %>% group_by(key) %>% #replace drug_ at the start of dos with nothin and gruop by key
mutate(row=row_number()) %>% spread(key,val) %>%
select(id,drug,dos,-row)
# A tibble: 10 x 3
id drug dos
<int> <chr> <chr>
1 1 a 31
2 1 c 49
3 2 b 26
4 2 d 81
5 3 c 55
6 3 e 37
7 4 d 6
8 4 f 54
9 5 e 45
10 5 g 17
Warning message:
attributes are not identical across measure variables;
they will be dropped
#This warning generated as we merged drug(chr) and dose(num) into one column (val)

How to matching missing IDs?

I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]

Filtering a Dataset by another Dataset in R

The task I am trying to accomplish is essentially filtering one dataset by the entries in another dataset by entries in an "id" column. The data sets I am working with are quite large having 10 of thousands of entries and 30 or so variables. I have made toy datasets to help explain what I want to do.
The first dataset contains a list of entries and each entry has their own unique accession number(this is the id).
Data1 = data.frame(accession_number = c('a','b','c','d','e','f'), values =c('1','3','4','2','3','12'))
>Data1
accession_number values
1 a 1
2 b 3
3 c 4
4 d 2
5 e 3
6 f 12
I am only interested in the entries that have the accession number 'c', 'd', and 'e'. (In reality though my list is around 100 unique accession numbers). Next, I created a dataframe with the only the unique accession numbers and no other values.
>SubsetData1
accession_number
1 c
2 d
3 e
The second data set, which i am looking to filter, contains multiple entries some which have the same accession number.
>Data2
accession_number values Intensity col4 col6
1 a 1 -0.0251304 a -0.4816370
2 a 2 -0.4308735 b -1.0335971
3 c 3 -1.9001321 c 0.6416735
4 c 4 0.1163934 d -0.4489048
5 c 5 0.7586820 e 0.5408650
6 b 6 0.4294415 f 0.6828412
7 b 7 -0.8045201 g 0.6677730
8 b 8 -0.9898947 h 0.3948412
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
18 f 18 -0.6660976 r 0.6159731
19 f 19 0.2089186 s -0.8222399
20 f 20 -1.5809582 t 1.5567113
21 f 21 0.3610700 u 0.3264431
22 f 22 1.2923324 v 0.9636267
What im looking to do is compare the subsetted list of the first data set(SubsetData1), with the second dataset (Data2) to create a filtered dataset that only contains the entries that have the same accession numbers defined in the subsetted list. The filtered dataset should look something like this.
accession_number values Intensity col4 col6
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
I don't know if I need to start making loops in order to tackle this problem, or if there is a simple R command that would help me accomplish this task. Any help is much appreciated.
Thank You
Try this
WantedData=Data2[Data2$ccession_number %in% SubsetData1$accession_number, ]
You can also use inner_join of dplyr package.
dat = inter_join(Data2, SubsetData1)
The subset function is designed for basic subsetting:
subset(Data2,accession_number %in% SubsetData1$accession_number)
Alternately, here you could merge:
merge(Data2,SubsetData1)
The other solutions seem fine, but I like the readability of dplyr, so here's a dplyr solution.
library(dplyr)
new_dataset <- Data2 %>%
filter(accession_number %in% SubsetData1$accession_number)

Expand Records to create Edges for igraph

I have a dataset that has multiple data points I want to map. iGraph uses 1-1 relationships though so I'm looking for a way to take one long record into many 1-1 records. For Example:
test <- data.frame(
drug1=c("A","B","C","D","E","F","G","H","I","J","K"),
drug2=c("P","O","R","T","L","A","N","D","R","A","D"),
drug3=c("B","O","R","I","S","B","E","C","K","E","R"),
age=c(15,20,35,1,35,58,51,21,54,80,75))
Which gives this output
drug1 drug2 drug3 age
1 A P B 15
2 B O O 20
3 C R R 35
4 D T I 1
5 E L S 35
6 F A B 58
7 G N E 51
8 H D C 21
9 I R K 54
10 J A E 80
11 K D R 75
I'd like to make a new table with drug1-drug2 and then stack drug2-drug3 into the previous column. So it would look like this.
drug1 drug2 age
1 A P 15
2 P B 15
3 C R 20
4 R R 20
5 E L 35
drug2 is held in the drug1 spot and drug3 is moved to drug1. I realize I can do this by creating multiple smaller steps, but was was wondering if anyone new of a way to loop this process. I have up to 11 fields.
Here are the smaller steps.
a <- test[,c("drug1","drug2","age")]
b <- test[,c("drug2","drug3","age")]
names(b) <- c("drug1","drug2","age")
test2 <- rbind(a,b)
drug1 drug2 age
1 A P 15
2 B O 20
3 C R 35
4 D T 1
5 E L 35
6 F A 58
7 G N 51
8 H D 21
9 I R 54
10 J A 80
11 K D 75
12 P B 15
13 O O 20
14 R R 35
15 T I 1
16 L S 35
17 A B 58
18 N E 51
19 D C 21
20 R K 54
21 A E 80
22 D R 75
So if you have many fields, here's a helper function which can pull down the data into pairs.
pulldown <- function(data, cols=1:(min(attr)-1),
attr=ncol(data), newnames=names(data)[c(cols[1:2], attr)]) {
if(is.character(attr)) attr<-match(attr, names(data))
if(is.character(cols)) cols<-match(cols, names(data))
do.call(rbind, lapply(unname(data.frame(t(embed(cols,2)))), function(x) {
`colnames<-`(data[, c(sort(x), attr)], newnames)
}))
}
You can run it with your data with
pulldown(test)
It has a parameter called attr where you can specify the columns (index or names) you would like repeated every row (here I have it default to the last column). Then the cols parameter is a vector of all the columns that you would like to turn into pairs. (The default is the beginning to one before the first attr). You can also specify a vector of newnames for the columns as they come out.
With three columns your method is pretty simple, this might be a better choice for 11 columns.
Slightly more compact and a one-liner would be:
test2 <- rbind( test[c("drug1","drug2","age")],
setNames(test[c("drug3", "drug2", "age")], c("drug1", "drug2", "age"))
)
The setNames function can be useful when column names are missing or need to be coerced to something else.

Extract rows from data frame based on multiple identifiers in another data frame

I would like to extract a selection of rows from a data frame based on multiple identifying variables contained in another data frame. Consider the following illustrative data set:
df <- data.frame(id=c(1,2,2,3,4,4,4,4,5), ref=c("A","B","C","D","E","F","F","G","H"), amount=c(10,15,20,25,30,35,-35,40,45))
required <- data.frame(id=c(2,3,4,4), ref=c("B","D","E","F"))
I would like the output in a data frame with id, ref and amount as follows:
id ref amount
2 B 15
3 D 25
4 E 30
4 F 35
4 F -35
Note in particular that id 4 and ref F have two matches from the df with amounts 35 and -35.
You want to merge:
merge(df, required)
## id ref amount
## 1 2 B 15
## 2 3 D 25
## 3 4 E 30
## 4 4 F 35
## 5 4 F -35

Resources