how to group my string vector in two data frame? - r

Hi: I have a simple question but it confuse me a lot. Below are my codes:
a <- data.frame(url = c("1","2","3","4","5"),
id = c("a","b","c","d","e")
)
b <- data.frame(url = c("1","1","2","2","2","3","3","3","3","4","4","5","5"),
price = c(10,10,20,20,20,30,30,30,30,40,40,50,50),
recipt=c("n","n","n","n","n","n","n","n","n","y","y","n","n")
)
I want my newdata , which merge b$recipt into a and becomes:
>newdata
url id recipt
1 a n
2 b n
3 c n
4 d y
5 e n
please give me some hint, thanks

You could try this:
a$recipt <- sapply(1:nrow(a),function(x) b$recipt[b$url==a$url[x]][1])
#> a
# url id recipt
#1 1 a n
#2 2 b n
#3 3 c n
#4 4 d y
#5 5 e n
Here it is assumed that the recipt entries are the same for any given value of url in b. If this is not the case, things become more complicated.
If you want to keep a unchanged and generate a new frame newdata with the new column, then the above code can be slightly modified in a rather trivial way:
newdata <- a
newdata$recipt <- sapply(1:nrow(a),function(x) b$recipt[b$url==a$url[x]][1])

You could use match
transform(a, recipt= b$recipt[match(url, b$url)])
# url id recipt
#1 1 a n
#2 2 b n
#3 3 c n
#4 4 d y
#5 5 e n
Or using the devel version of data.table. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(a)[unique(b[c(1,3)], by='url'), on='url']
# url id recipt
#1: 1 a n
#2: 2 b n
#3: 3 c n
#4: 4 d y
#5: 5 e n

So I think what you want is to merge a onto b as there are multiple prices with the same url. Thus b is your base data frame and you want to append an id value to it. Some of the id values would be repeated.
One easy way is to do this with dplyr.
library(dplyr)
a <- data.frame(url = c("1","2","3","4","5"),
id = c("a","b","c","d","e")
)
b <- data.frame(url = c("1","1","2","2","2","3","3","3","3","4","4","5","5"),
price = c(10,10,20,20,20,30,30,30,30,40,40,50,50),
recipt=c("n","n","n","n","n","n","n","n","n","y","y","n","n")
)
left_join(b, a, by = "url")
url price recipt id
1 1 10 n a
2 1 10 n a
3 2 20 n b
4 2 20 n b
5 2 20 n b
6 3 30 n c
7 3 30 n c
8 3 30 n c
9 3 30 n c
10 4 40 y d
11 4 40 y d
12 5 50 n e
13 5 50 n e

Related

Which group meet the criterion a < b < c depending on condition

My title might not be very informative but this is an example which exposes my problem :
I have this dataframe :
df=data.frame(cond1=c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
group=c("F","V","M","F","V","M","F","V","M","F","V","M","F","V","M","F","V","M"),
gene=c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B"),
value=c(1,2,3,4,5,6,7,8,9,1,3,2,4,3,2,2,3,4))
df
cond1 group gene value
1 1 F A 1
2 1 V A 2
3 1 M A 3
4 2 F A 4
5 2 V A 5
6 2 M A 6
7 3 F A 7
8 3 V A 8
9 3 M A 9
10 1 F B 1
11 1 V B 3
12 1 M B 2
13 2 F B 4
14 2 V B 3
15 2 M B 2
16 3 F B 2
17 3 V B 3
18 3 M B 4
What I would like to obtain is for each gene, the sum of how many different cond1 have their value corresponding with F group smaller than their value corresponding with V their value corresponding with M.
In the 3 first lines, we are in gene A for the cond1. value correspoding to group F=1, V=2, M=3. So F<V<M for the A gene for the cond1=1 group.
My expected output for the gene A is 3 as all cond1 groups meet F<V<M for value.
My expected output for the gene B is 1 as only cond1=3 group meet F<V<M for value.
My desired output would be ideally a dataframe with gene and the sum of cond1 than meet my criterion :
gene count
1 A 3
2 B 1
I would be very grateful if you could provide me any tips on how should I proceed
Check if all the data is in increasing order and count how many such values exist for each gene.
library(dplyr)
df %>%
#If the data is not ordered, order it using arrange
#arrange(gene, cond1, match(group, c('F', 'V', 'M'))) %>%
group_by(gene, cond1) %>%
summarise(cond = all(diff(value) > 0)) %>%
summarise(count = sum(cond))
# gene count
# <chr> <int>
#1 A 3
#2 B 1
Using data.table
library(data.table)
setDT(df)[, .(cond = all(diff(value) > 0)), .(gene, cond1)][, .(count = sum(cond)), gene]
gene count
1: A 3
2: B 1

Simplest way to replace a list of values in a data frame with a list of new values

Say we have a data frame with a factor (Group) that is a grouping variable for a list of IDs:
set.seed(123)
data <- data.frame(Group = factor(sample(5,10, replace = T)),
ID = c(1:10))
In this example, the ID's belong to one of 5 Groups, labeled 1:5. We simply want to replace 1:5 with A:E. In other words, if Group == 1, we want to change it to A, if Group == 2, we want to change it to B, and so on. What is the simplest way to achieve this?
You may assign new labels= in a names list using factor once again.
data$Group1 <- factor(data$Group, labels=list("1"="A", "2"="B", "3"="C", "4"="D", "5"="E"))
## more succinct:
data$Group2 <- factor(data$Group, labels=setNames(list("A", "B", "C", "D", "E"), 1:5))
data
# Group ID Group1 Group2 Group3
# 1 3 1 C C C
# 2 3 2 C C C
# 3 2 3 B B B
# 4 2 4 B B B
# 5 3 5 C C C
# 6 5 6 E E E
# 7 4 7 D D D
# 8 1 8 A A A
# 9 2 9 B B B
# 10 3 10 C C C
This for general, if indeed capital letters are wanted see #RonakShah's solution.
You can use the built-in constant in R LETTERS :
data$new_group <- LETTERS[data$Group]
data
# Group ID new_group
#1 3 1 C
#2 3 2 C
#3 2 3 B
#4 2 4 B
#5 3 5 C
#6 5 6 E
#7 4 7 D
#8 1 8 A
#9 2 9 B
#10 3 10 C
Created a new column (new_group) here for comparison purposes. You can overwrite the same column if you wish to.

Get first and last value from groups using rle

I want to get first and last value for groups using grouping similar to what rle() function does.
For example I have this data frame:
> df
df time
1 1 A
2 1 B
3 1 C
4 1 D
5 2 E
6 2 F
7 2 G
8 1 H
9 1 I
10 1 J
11 3 K
12 3 L
13 3 M
14 2 N
15 2 O
16 2 P
I want to get something like this:
> want
df first last
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
How you can see, I want to group my values in a way rle() function does. I want to group elements only when this same value is next to each other. group_by groups elements in the different way.
> rle(df$df)
Run Length Encoding
lengths: int [1:5] 4 3 3 3 3
values : num [1:5] 1 2 1 3 2
Is there a solution for my problem? Any advice will be appreciated.
There is a function rleid from data.table that does that job, i.e.
library(data.table)
setDT(dt)[, .(df = head(df, 1),
first = head(time, 1),
last = tail(time, 1)),
by = (grp = rleid(df))][, grp := NULL][]
Which gives,
df first last
1: 1 A D
2: 2 E G
3: 1 H J
4: 3 K M
5: 2 N P
Adding a dplyr approach, as #RonakShah mentions
library(dplyr)
df %>%
group_by(grp = cumsum(c(0, diff(df)) != 0)) %>%
summarise(df = first(df),
first = first(time),
last = last(time)) %>%
select(-grp)
Giving,
# A tibble: 5 x 3
df first last
<int> <chr> <chr>
1 1 A D
2 2 E G
3 1 H J
4 3 K M
5 2 N P
Here is an option using base R with rle. Once we do the rle on the first column, replicate the sequence of values with lengths, use that to create logical index with duplicated, then subset the values of the original dataset based on the index
rl <- rle(df[,1])
i1 <- rep(seq_along(rl$values), rl$lengths)
i2 <- !duplicated(i1)
i3 <- !duplicated(i1, fromLast = TRUE)
wanted <- data.frame(df = df[i2,1], first = df[i2,2], last = df[i3,2])
wanted
# df first last
#1 1 A D
#2 2 E G
#3 1 H J
#4 3 K M
#5 2 N P

Regroup, summarise and combine variables

I've been breaking my head to understand how to do this but so far I couldn't find an easy solution.
I have the following dataset:
Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4
What I'm trying to do is based on the Itin variable, to create a path variable, while keeping the passengers variable.
The easiest way of understanding this is by seeing it as taking a normal flight with a scale somewhere. For example in Itin = 1 one passenger goes from A to B to C. The only thing that has to be kept is the Origin A Destination B, destination C and passengers as it is, which is equal to 1. Just like on the example below.
Path Passengers
A-B-C 1
A-B 3
E-B 10
A-C 2
E-B 4
I've tried several options with group_by with dplyr, as it is often quicker than the base options, but I couldn't really get the result as on the second example with a new variable Path. I thought as well to use tidyr but I'm not really sure how it could help here.
Any idea on how to do this?
Edit: As for the Path variable, it doesn't really matter if ends up as A-B-C, or A,B,C or A B C as I will only look at the syntax.
EDIT A faster solution using data.table
df1<-read.table(text="Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4",header=TRUE, stringsAsFactors=FALSE)
library(data.table)
DT <-data.table(df1)
DT[,.(Passengers, Path = paste(Origin[1],paste(Destination, collapse = " "),
collapse = " ")), by=Itin]
Itin Passengers Path
1: 1 1 A B C
2: 1 1 A B C
3: 2 3 A B
4: 3 10 E B
5: 4 2 A C
6: 5 4 E B
Here's my orignal solution with dplyr:
df1<-read.table(text="Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Itin) %>%
summarise(Passengers=max(Passengers),
Path = paste(Origin[1],paste(Destination, collapse = " "),
collapse = " "))
# A tibble: 5 × 3
Itin Passengers Path
<int> <int> <chr>
1 1 1 A B C
2 2 3 A B
3 3 10 E B
4 4 2 A C
5 5 4 E B
Reading data:
read.table(textConnection("Itin Origin Destination Passengers
1 A B 1
1 B C 1
2 A B 3
3 E B 10
4 A C 2
5 E B 4"), header=T, stringsAsFactors=F) -> df
Using base R in this case:
Path <- lapply(unique(df$Itin), function(it) {
x <- subset(df, Itin==it)
c(x$Origin[1], x$Destination)
})
new_df <- unique(df[,c("Itin", "Passengers")])
new_df$Path <- Path
> new_df
Itin Passengers Path
1 1 1 A, B, C
3 2 3 A, B
4 3 10 E, B
5 4 2 A, C
6 5 4 E, B

Assigning rows of data.frame to another data.frame in R based on frequency of element's occurance

I have a data.frame df
> df
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
I found the frequency of each element's occurrence considering column V1 and sorted the Freq column in ascending order
>dfFreq <- as.data.frame(table(df$V1))
Var1 Freq
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 1
7 g 1
>dfFreqSorted <- dfFreq[order(dfFreq$Freq),]
Var1 Freq
5 e 1
6 f 1
7 g 1
4 d 2
1 a 3
2 b 3
3 c 3
Now what i want to do is to create a new data.frame based on original data.frame such that each "Var1" item in "dfFreqSorted" is used according to it's Freq but once every time going from the top of "dfFreqSorted" to the bottom which would give the result below:
So consider the first Var1 item which is "e" so the first matching row of "e" in V1 column of df is (e,f) which would be the first item in the new data.frame.
I figured that this can be done using:
>subset(df, V1==dfFreqSorted$Var[1])[1,]
V1 V2
12 e f
So if i used a for loop and looped through all the elements in the Var1 column of dfFreqSorted and used the subset command above and rbind the returned result into another data.frame I would have something like below
V1 V2
12 e f
13 f g
14 g h
10 d g
1 a b
4 b c
7 c d
Now this result shows each Var1 item once. I need the remaining rows as shown below such that after finishing first iteration of all the rows of Var1 once, the loop should go again to the beginning and check the frequency of all Var1 whose frequency is more than 1 now and find the next row from df for that element so the remaining rows that should be produced in the same data.frame as shown below:
11 d h
2 a e
5 b e
8 c g
3 a f
6 b f
9 c h
As you can see above that all elements are considered in Var1 whose frequency is 1 are used first then those whose frequency is greater than 1 (i.e 2) and are used once then in the next iteration those are used whose freq is greater than 2 (i.e 3) are used. Used such that corresponding unused row of that element is fetched from df.
So in short all the elements of df are arranged in anew data.frame such that elements are used in ascending order of their frequencies but used once first and then twice or thrice in every iteration based on what their frequency is.
I am not asking for the whole code just few guidelines of how i can achieve the objective. Thanks in advance.
Hello #akrun i am a beginner so the solution might be really a beginner level approach but it solved my problem perfectly fine.
> a<-read.table("isnodes.txt")
> a
V1 V2
1 a b
2 a e
3 a f
4 b c
5 b e
6 b f
7 c d
8 c g
9 c h
10 d g
11 d h
12 e f
13 f g
14 g h
> aF<-as.data.frame(table(a$V1))
> aF
Var1 Freq
1 a 3
2 b 3
3 c 3
4 d 2
5 e 1
6 f 1
7 g 1
> aFsorted <- aF[order(aF$Freq),]
> aFsorted
Var1 Freq
5 e 1
6 f 1
7 g 1
4 d 2
1 a 3
2 b 3
3 c 3
> sortedEdgeList <- a[-c(1:nrow(a)),]
> sortedEdgeList
[1] V1 V2
<0 rows> (or 0-length row.names)
> aFsorted <- cbind(aFsorted, Used=0)
> aFsorted
Var1 Freq Used
5 e 1 0
6 f 1 0
7 g 1 0
4 d 2 0
1 a 3 0
2 b 3 0
3 c 3 0
> maxFreq <- max(aFsorted$Freq)
> maxFreq
[1] 3
> for(i in 1:maxFreq){
+ rows<-nrow(aFsorted)
+ for(j in 1:rows){
+ Var1Value<-aFsorted$Var[j]
+ Var1Edge<-a[match(aFsorted$Var1[j],a$V1),]
+ sortedEdgeList<-rbind(sortedEdgeList,Var1Edge)
+ a<-a[!(a$V1==Var1Edge$V1 & a$V2==Var1Edge$V2),]
+ aFsorted$Used[j]=aFsorted$Used[j]+1
+ }
+ if(aFsorted$Used==aFsorted$Freq){
+ aFsorted<-aFsorted[!(aFsorted$Used==aFsorted$Freq),]
+ }
+ }
Warning messages:
1: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
2: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
3: In if (aFsorted$Used == aFsorted$Freq) { :
the condition has length > 1 and only the first element will be used
> sortedEdgeList
V1 V2
12 e f
13 f g
14 g h
10 d g
5 a b
4 b c
7 c d
11 d h
2 a e
51 b e
8 c g
3 a f
6 b f
9 c h
I'm not sure this is what you want, but it might be close. It helps conceptually to keep the frequencies in the original data frame.
library("plyr")
set.seed(3)
df <- data.frame(V1 = sample(letters[1:10], 20, replace = TRUE),
V2 = sample(letters[1:10], 20, replace = TRUE),
stringsAsFactors = FALSE)
df$freqV1 <- NA_integer_
for (i in 1:nrow(df)) {
df$freqV1[i] <- length(grep(pattern = df$V1[i], x = df$V1))
}
df2 <- arrange(df, freqV1, V2) # you may want just arrange(df, freqV1)
which gives:
V1 V2 freqV1
1 h c 1
2 d a 2
3 d b 2
4 c c 2
5 c j 2
6 b c 3
7 g c 3
8 b f 3
9 g h 3
10 g h 3
11 b i 3
12 i a 4
13 i c 4
14 i d 4
15 i f 4
16 f b 5
17 f d 5
18 f d 5
19 f e 5
20 f f 5

Resources