How can I arrange data.frame according to the factor levels? - r

there is data.frame df, i want to arrange (sort) it by index column according the factor levels.
the result as "the wished data.frame". anyone can help ? thanks!
#create data frame
df<-data.frame(index=c("b","a","e"),amount=c(10,76,60))
df$index<-factor(df$index,levels=c("a","b","e"))
# current df
index amount
1 b 10
2 a 76
3 e 60
# the wished data.frame
index amount
1 a 76
2 b 10
3 e 60

Like this?
arrange(df, match(df$index, levels(df$index)))
index amount
1 a 76
2 b 10
3 e 60
Data
df<-data.frame(index=c("b","a","e"),amount=c(10,76,60))
df$index<-factor(df$index,levels=c("b","e","a"))

You can use order :
df[order(df$index), ]
# index amount
#2 a 76
#1 b 10
#3 e 60

Related

How to find first occurrence of a vector of numeric elements within a data frame column?

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks
Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60
Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Data frame manoeuvre [duplicate]

This question already has an answer here:
R programming - data frame manoevur
(1 answer)
Closed 7 years ago.
Suppose I have the following dataframe:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d 45
5 e 52
6 f 65
7 g 76
8 a 13
9 b 24
I'd like to turn it into a new dataframe like the following:
Categories Variable
1 a 11
2 b 21
3 c 34
4 d+e 97
5 f 65
6 g 76
7 a 13
8 b 24
How can I do it? (Surely, the dataframe is much larger, but I want the sum of all categories of d and e and group it into a new category, say 'H').
Many thanks!
This is a good question but unfortunately OT here. So I'll answer until it get migrated.
I'm assuming Variable is of class factor, so you'll need to properly re-level it (assuming your data is called df)
levels(df$Categories)[levels(df$Categories) %in% c("d", "e")] <- "h"
Next, I'll use the data.table package as you have a large data set and it's devel version (v >= 1.9.5) has a convinient function called rleid (download from GitHub)
library(data.table) ## v >= 1.9.5
setDT(df)[, .(Variable = sum(Variable)), by = .(indx = rleid(Categories), Categories)]
# indx Categories Variable
# 1: 1 a 11
# 2: 2 b 21
# 3: 3 c 34
# 4: 4 h 97
# 5: 5 f 65
# 6: 6 g 76
# 7: 7 a 13
# 8: 8 b 24
You can try this:
# plyr package provides rbind.fill() function for row binding
library(plyr)
# Assuming you have a rows.cvs containing the data, read it into a data frame
data<-read.csv("rows.csv",stringsAsFactors=FALSE)
# Find the lowest index of d or e (whichever comes first)
index<-min(match("d",data$Var1.nominal.), match("e",data$Var1.nominal.))
# Returns all rows containing d and e in Var1(nominal) column
tempData<-data[data$Var1.nominal. %in% c("d","e"),]
# Remove all the rows containing d and e from original data frame
data<-data[!data$Var1.nominal. %in% c("d","e"),]
# Reorder row index numbers in data
rownames(data)<-NULL
# Combine rows containing d and e in Var1(nominal)column, and sum up the column Var2(numeric)
tempData<-data.frame(Var1.nominal.="d+e",Var2.numeric.=sum(tempData[,2]))
# Combine original data and tempData frame with use of index
data<-rbind.fill(data[1:(index-1),],tempData,data[index:length(data[,1]),])
# Renaming "d+e" to"h"
data[index,1]="h"
# Getting rid of the tempData data frame
rm(tempData)
Output:
> data
Var1.nominal. Var2.numeric.
1 a 11
2 b 21
3 c 34
4 h 97
5 f 65
6 g 76
7 a 13
8 b 24

Extract rows from data frame based on multiple identifiers in another data frame

I would like to extract a selection of rows from a data frame based on multiple identifying variables contained in another data frame. Consider the following illustrative data set:
df <- data.frame(id=c(1,2,2,3,4,4,4,4,5), ref=c("A","B","C","D","E","F","F","G","H"), amount=c(10,15,20,25,30,35,-35,40,45))
required <- data.frame(id=c(2,3,4,4), ref=c("B","D","E","F"))
I would like the output in a data frame with id, ref and amount as follows:
id ref amount
2 B 15
3 D 25
4 E 30
4 F 35
4 F -35
Note in particular that id 4 and ref F have two matches from the df with amounts 35 and -35.
You want to merge:
merge(df, required)
## id ref amount
## 1 2 B 15
## 2 3 D 25
## 3 4 E 30
## 4 4 F 35
## 5 4 F -35

Subset a data frame based on another

I have two data frames, x and y.
x<-data.frame(id=c(1,2,3,4,5), g=c(21,52,43,94,35))
y<-data.frame(id=c(3,4,7), u=c(55, 77, 99))
I want to subset x to include only the observations with "IDs" that are also in y.
What is the best way of doing this?
Thanks!
Use setdiff to exclude observations appearing in both df
> x[setdiff(x$id, y$id),]
id g
1 1 21
2 2 52
5 5 35
Use merge to include observations present in both df
> merge(x, y)
id g u
1 3 43 55
2 4 94 77
or looking for this subset?
> x[intersect(x$id, y$id),]
id g
3 3 43
4 4 94
The accepted answer only works because the values 3 and 4 in x$id happen to be located in rows 3 and 4. The wrong answer will be obtained, for example, if:
x<-data.frame(id=c(1,3,2,4,5), g=c(21,52,43,94,35))
x[intersect(x$id, y$id),]
id g
3 2 43
4 4 94
The following will work properly, regardless of the position of the common elements:
x[is.element(x$id,intersect(x$id,y$id)),]

Resources