Doing a series of operations on every subset of the data obtained from a dataframe - r

This is a question of a noob in 'R' world. I tried searching and there were quite a few solutions that came close (e.g aggregate, by, etc), but I lacked the understanding to apply it to my problem. Would really appreciate if someone can guide me in a more detailed way.
Hypothetical Dataset
Name Wheels Color Mileage seat_capacity
1 2 Red 70 2
2 3 Black 60 7
3 4 Blue 12 5
4 4 White 15 6
5 3 Yellow 45 6
6 2 Green 70 2
7 3 Silver 45 6
8 6 Silver 5 4
9 14 Red 12 2
10 2 Black 70 7
11 4 Blue 70 5
12 3 White 60 6
13 4 Yellow 12 6
14 4 Green 15 2
I have initially created subsets of data based on color using split.
color <- split(df,df$color)
For each of the subsets created I would be doing more operations e.g
finding the vehicles with highest mileage among the vehicles with lowest number of wheels in each subset.....etc
I have written all the rules pertaining to the later half as well. I am struggling to find a way where I can run all the operations on each of the subset in the variable color.
Any help would be appreciated.

The following worked for me and I would sincerely want to thank #Imo and #aosmith for guiding me.
Assume, I would want to first group the df based on colour and then group further by wheels and then within each such subgroup(wheels) pick top 2 vehicles based on Mileage. Used the dplyr library to achieve the same.
my_list <- df %>% group_by(color, wheels) %>% top_n(2,Mileage)
HTH

Related

Is there an R function to help me plot the network connections for a single node?

This is my original dataset. R1,R2 and R3 are word association responses for the cue word. tf and df are total and document frequency of the cue word, respectively.
[1]: https://i.stack.imgur.com/wpfZy.png [Image shows original dataframe}
I have cleaned up a dataset into a nodes list and an edge list. I have over a million rows in both lists. Plotting this as a network graph would take too long, and also be very dense, i.e. not understandable.
[2]: https://i.stack.imgur.com/mfSfN.png [Image shows node-list]
[3]: https://i.stack.imgur.com/l60Eu.png [Image shows edge-list]
I want to be able to make a network graph for the cue words, such that upon entering a cue word, I get a network of words that are either responses to it, or are words that the cue word is a response for.
For example, I want to see all the connections for the word 'money'. Using filter(nword == "money") only shows the node 'money' as an output, but I want all nodes connected to the cue word (in this case, 'money').
[4]: https://i.stack.imgur.com/1bKrr.png [Image shows filter()]
Is there a function or a chunk of code that would help me resolve this issue?
from
to
1
1
1
6
1
8
1
17
1
18
1
22
1
23
1
38
1
67
1
80
2
82736
2
88035
2
103428
3
11
3
27
3
45
node_id
nword
n
1
money
13633
2
food
12338
3
water
12276
4
car
8907
5
music
8351
6
green
7890
7
red
7623
8
love
7406
9
sex
6552
10
happy
6432
11
cold
6333
12
bad
6132
13
sad
5958
14
dog
5940
15
white
5910
16
school
5832
17
fun
5594
18
time
5467
19
black
5233
20
hair
5219

Optimal binning for numerical data using R

I have a data frame that looks like this
data link: https://1drv.ms/t/s!ArOzUuixE-mg6W7zY2Xvgu80dCsL?e=BuP6xM
letters counts
1 AAAAAA 21
2 AAAAAAAA 9
3 AAAAAAAACAAGGA 1
4 AAAAAAAAGAGT 1
5 AAAAAAACA 24
6 AAAAAAACACAAG 1
7 AAAAAAACAGGG 41
8 AAAAAAACAGTCAATCCTA 2
9 AAAAAAAG 48
10 AAAAAAAGCTGT 2
I have millions of rows like this. I have tried the package "smbinning"
but I am not sure how it can be applied to this type of data.
Do you know any other package or how the smbinning might work.
Thank for your time

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Adding all values of a variable in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I don't know how to word the title exactly, so I will just do my best to explain below... Sorry in advance for the .csv format.
I have the following example dataset:
print(data)
ID Tag Flowers
1 1 6871 1
2 2 6750 1
3 3 6859 1
4 4 6767 1
5 5 6747 1
6 6 6261 1
7 7 6750 1
8 8 6767 1
9 9 6812 1
10 10 6746 1
11 11 6496 4
12 12 6497 1
13 13 6495 4
14 14 6481 1
15 15 6485 1
Notice that in Lines 2 and 7, the tag 6750 appears twice. I observed one flower on plant number 6750 on two separate days, equaling two flowers in its lifetime. Basically, I want to add every flower that occurs for tag 6750, tag 6767, etc throughout ~100 rows. Each tag appears more than once, usually around 4 or 5 times.
I feel like I need to apply the unlist function here, but I'm a little bit lost as to how I should do so.
Without any extra packages, you can use function aggregate():
res<-aggregate(data$Flowers, list(data$Tag), sum)
This calculates a sum of the values in Flowers column for every value in the Tag column.

making a new dataframe by looking for keywords in specific variable

I have a big dataset of about 35000 cases X 32 variables
one of those variables is Description in which a description of status is given. for example: patient suffered ischemic stroke.
Now I would like to make a dataframe in which I place all cases in which the word "stroke", "STROKE" or "Stroke" is found in the variable Description.
Could anyone suggest a efficient way to do this. Because now I just added all by hand in a very inefficient way:
df1<-rbind(df[1,],df[2,],df[3,]
It works but it's unbelievably inelegant and prone to mistakes.
Here I create some example data to work with.
a <- c(1:10)
b <- c(11:20)
description <- c("Stroke","ALS","Parkinsons","STROKE","STROKE","stroke","Alzheimers","Stroke","ALS","Parkinsons")
df<-data.frame(a,b,description)
df
a b description
1 1 11 Stroke
2 2 12 ALS
3 3 13 Parkinsons
4 4 14 STROKE
5 5 15 STROKE
6 6 16 stroke
7 7 17 Alzheimers
8 8 18 Stroke
9 9 19 ALS
10 10 20 Parkinsons
With this code you can remove every case (row) that is not associated with "Stroke", "STROKE" or "stroke":
df1<-df[!(df$description!="STROKE" & df$description!="Stroke" & df$description!="stroke"),]
df1
a b description
1 1 11 Stroke
4 4 14 STROKE
5 5 15 STROKE
6 6 16 stroke
8 8 18 Stroke
Hope this was what you were looking for.

Resources