This question already has an answer here:
R programming - data frame manoevur
(1 answer)
Closed 7 years ago.
Suppose I have the following dataframe:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d 45
5 e 52
6 f 65
7 g 76
8 a 13
9 b 24
I'd like to turn it into a new dataframe like the following:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d+e 97
5 f 65
6 g 76
7 a 13
8 b 24
How can I do it? (Surely, the dataframe is much larger, but I want the sum of all categories of d and e and group it into a new category, say 'H').
Many thanks!
This is a good question but unfortunately OT here. So I'll answer until it get migrated.
I'm assuming Variable is of class factor, so you'll need to properly re-level it (assuming your data is called df)
levels(df$Categories)[levels(df$Categories) %in% c("d", "e")] <- "h"
Next, I'll use the data.table package as you have a large data set and it's devel version (v >= 1.9.5) has a convinient function called rleid (download from GitHub)
library(data.table) ## v >= 1.9.5
setDT(df)[, .(Variable = sum(Variable)), by = .(indx = rleid(Categories), Categories)]
# indx Categories Variable
# 1: 1 a 11
# 2: 2 b 21
# 3: 3 c 34
# 4: 4 h 97
# 5: 5 f 65
# 6: 6 g 76
# 7: 7 a 13
# 8: 8 b 24
You can try this:
# plyr package provides rbind.fill() function for row binding
library(plyr)
# Assuming you have a rows.cvs containing the data, read it into a data frame
data<-read.csv("rows.csv",stringsAsFactors=FALSE)
# Find the lowest index of d or e (whichever comes first)
index<-min(match("d",data$Var1.nominal.), match("e",data$Var1.nominal.))
# Returns all rows containing d and e in Var1(nominal) column
tempData<-data[data$Var1.nominal. %in% c("d","e"),]
# Remove all the rows containing d and e from original data frame
data<-data[!data$Var1.nominal. %in% c("d","e"),]
# Reorder row index numbers in data
rownames(data)<-NULL
# Combine rows containing d and e in Var1(nominal)column, and sum up the column Var2(numeric)
tempData<-data.frame(Var1.nominal.="d+e",Var2.numeric.=sum(tempData[,2]))
# Combine original data and tempData frame with use of index
data<-rbind.fill(data[1:(index-1),],tempData,data[index:length(data[,1]),])
# Renaming "d+e" to"h"
data[index,1]="h"
# Getting rid of the tempData data frame
rm(tempData)
Output:
> data
Var1.nominal. Var2.numeric.
1 a 11
2 b 21
3 c 34
4 h 97
5 f 65
6 g 76
7 a 13
8 b 24
Related
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I've got a data frame like this:
set.seed(100)
drugs <- data.frame(id = 1:5,
drug_1 = letters[1:5], drug_dos_1 = sample(100,5),
drug_2 = letters[3:7], drug_dos_2 = sample(100,5)
)
id drug_1 drug_dos_1 drug_2 drug_dos_2
1 a 31 c 49
2 b 26 d 81
3 c 55 e 37
4 d 6 f 54
5 e 45 g 17
I'd like to transform this messy table into a tidy table with all drugs of an id in one column and the corresponding drug dosages in one column. The table should look like this in the end:
id drug dosage
1 a 31
1 c 49
2 b 26
2 d 81
etc
I guess this could be achieved by using a reshaping function that transforms by data from wide to long format but I didn't manage.
One option is melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(drugs), measure = patterns('^drug_\\d+$', 'dos'),
value.name = c('drug', 'dosage'))[, variable := NULL][order(id)]
# id drug dosage
#1: 1 a 31
#2: 1 c 49
#3: 2 b 26
#4: 2 d 81
#5: 3 c 55
#6: 3 e 37
#7: 4 d 6
#8: 4 f 54
#9: 5 e 45
#10 5 g 17
Here, the 'drug' is common in all the columns, so we need to create a unique pattern. One way is to specify the starting location (^) followed by the 'drug' substring, then underscore (_) and one or more numbers (\\d+) at the end ($) of the string. For the 'dos', just use that substring to match those column names that have 'dos'
library(dplyr)
drugs %>% gather(key,val,-id) %>% mutate(key=gsub('_\\d','',key)) %>% #replace _1 and _2 at the end wiht nothing
mutate(key=gsub('drug_','',key)) %>% group_by(key) %>% #replace drug_ at the start of dos with nothin and gruop by key
mutate(row=row_number()) %>% spread(key,val) %>%
select(id,drug,dos,-row)
# A tibble: 10 x 3
id drug dos
<int> <chr> <chr>
1 1 a 31
2 1 c 49
3 2 b 26
4 2 d 81
5 3 c 55
6 3 e 37
7 4 d 6
8 4 f 54
9 5 e 45
10 5 g 17
Warning message:
attributes are not identical across measure variables;
they will be dropped
#This warning generated as we merged drug(chr) and dose(num) into one column (val)
I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]
The task I am trying to accomplish is essentially filtering one dataset by the entries in another dataset by entries in an "id" column. The data sets I am working with are quite large having 10 of thousands of entries and 30 or so variables. I have made toy datasets to help explain what I want to do.
The first dataset contains a list of entries and each entry has their own unique accession number(this is the id).
Data1 = data.frame(accession_number = c('a','b','c','d','e','f'), values =c('1','3','4','2','3','12'))
>Data1
accession_number values
1 a 1
2 b 3
3 c 4
4 d 2
5 e 3
6 f 12
I am only interested in the entries that have the accession number 'c', 'd', and 'e'. (In reality though my list is around 100 unique accession numbers). Next, I created a dataframe with the only the unique accession numbers and no other values.
>SubsetData1
accession_number
1 c
2 d
3 e
The second data set, which i am looking to filter, contains multiple entries some which have the same accession number.
>Data2
accession_number values Intensity col4 col6
1 a 1 -0.0251304 a -0.4816370
2 a 2 -0.4308735 b -1.0335971
3 c 3 -1.9001321 c 0.6416735
4 c 4 0.1163934 d -0.4489048
5 c 5 0.7586820 e 0.5408650
6 b 6 0.4294415 f 0.6828412
7 b 7 -0.8045201 g 0.6677730
8 b 8 -0.9898947 h 0.3948412
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
18 f 18 -0.6660976 r 0.6159731
19 f 19 0.2089186 s -0.8222399
20 f 20 -1.5809582 t 1.5567113
21 f 21 0.3610700 u 0.3264431
22 f 22 1.2923324 v 0.9636267
What im looking to do is compare the subsetted list of the first data set(SubsetData1), with the second dataset (Data2) to create a filtered dataset that only contains the entries that have the same accession numbers defined in the subsetted list. The filtered dataset should look something like this.
accession_number values Intensity col4 col6
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
I don't know if I need to start making loops in order to tackle this problem, or if there is a simple R command that would help me accomplish this task. Any help is much appreciated.
Thank You
Try this
WantedData=Data2[Data2$ccession_number %in% SubsetData1$accession_number, ]
You can also use inner_join of dplyr package.
dat = inter_join(Data2, SubsetData1)
The subset function is designed for basic subsetting:
subset(Data2,accession_number %in% SubsetData1$accession_number)
Alternately, here you could merge:
merge(Data2,SubsetData1)
The other solutions seem fine, but I like the readability of dplyr, so here's a dplyr solution.
library(dplyr)
new_dataset <- Data2 %>%
filter(accession_number %in% SubsetData1$accession_number)
I have a dataset that has multiple data points I want to map. iGraph uses 1-1 relationships though so I'm looking for a way to take one long record into many 1-1 records. For Example:
test <- data.frame(
drug1=c("A","B","C","D","E","F","G","H","I","J","K"),
drug2=c("P","O","R","T","L","A","N","D","R","A","D"),
drug3=c("B","O","R","I","S","B","E","C","K","E","R"),
age=c(15,20,35,1,35,58,51,21,54,80,75))
Which gives this output
drug1 drug2 drug3 age
1 A P B 15
2 B O O 20
3 C R R 35
4 D T I 1
5 E L S 35
6 F A B 58
7 G N E 51
8 H D C 21
9 I R K 54
10 J A E 80
11 K D R 75
I'd like to make a new table with drug1-drug2 and then stack drug2-drug3 into the previous column. So it would look like this.
drug1 drug2 age
1 A P 15
2 P B 15
3 C R 20
4 R R 20
5 E L 35
drug2 is held in the drug1 spot and drug3 is moved to drug1. I realize I can do this by creating multiple smaller steps, but was was wondering if anyone new of a way to loop this process. I have up to 11 fields.
Here are the smaller steps.
a <- test[,c("drug1","drug2","age")]
b <- test[,c("drug2","drug3","age")]
names(b) <- c("drug1","drug2","age")
test2 <- rbind(a,b)
drug1 drug2 age
1 A P 15
2 B O 20
3 C R 35
4 D T 1
5 E L 35
6 F A 58
7 G N 51
8 H D 21
9 I R 54
10 J A 80
11 K D 75
12 P B 15
13 O O 20
14 R R 35
15 T I 1
16 L S 35
17 A B 58
18 N E 51
19 D C 21
20 R K 54
21 A E 80
22 D R 75
So if you have many fields, here's a helper function which can pull down the data into pairs.
pulldown <- function(data, cols=1:(min(attr)-1),
attr=ncol(data), newnames=names(data)[c(cols[1:2], attr)]) {
if(is.character(attr)) attr<-match(attr, names(data))
if(is.character(cols)) cols<-match(cols, names(data))
do.call(rbind, lapply(unname(data.frame(t(embed(cols,2)))), function(x) {
`colnames<-`(data[, c(sort(x), attr)], newnames)
}))
}
You can run it with your data with
pulldown(test)
It has a parameter called attr where you can specify the columns (index or names) you would like repeated every row (here I have it default to the last column). Then the cols parameter is a vector of all the columns that you would like to turn into pairs. (The default is the beginning to one before the first attr). You can also specify a vector of newnames for the columns as they come out.
With three columns your method is pretty simple, this might be a better choice for 11 columns.
Slightly more compact and a one-liner would be:
test2 <- rbind( test[c("drug1","drug2","age")],
setNames(test[c("drug3", "drug2", "age")], c("drug1", "drug2", "age"))
)
The setNames function can be useful when column names are missing or need to be coerced to something else.
I would like to extract a selection of rows from a data frame based on multiple identifying variables contained in another data frame. Consider the following illustrative data set:
df <- data.frame(id=c(1,2,2,3,4,4,4,4,5), ref=c("A","B","C","D","E","F","F","G","H"), amount=c(10,15,20,25,30,35,-35,40,45))
required <- data.frame(id=c(2,3,4,4), ref=c("B","D","E","F"))
I would like the output in a data frame with id, ref and amount as follows:
id ref amount
2 B 15
3 D 25
4 E 30
4 F 35
4 F -35
Note in particular that id 4 and ref F have two matches from the df with amounts 35 and -35.
You want to merge:
merge(df, required)
## id ref amount
## 1 2 B 15
## 2 3 D 25
## 3 4 E 30
## 4 4 F 35
## 5 4 F -35