making a new dataframe by looking for keywords in specific variable - r

I have a big dataset of about 35000 cases X 32 variables
one of those variables is Description in which a description of status is given. for example: patient suffered ischemic stroke.
Now I would like to make a dataframe in which I place all cases in which the word "stroke", "STROKE" or "Stroke" is found in the variable Description.
Could anyone suggest a efficient way to do this. Because now I just added all by hand in a very inefficient way:
df1<-rbind(df[1,],df[2,],df[3,]
It works but it's unbelievably inelegant and prone to mistakes.

Here I create some example data to work with.
a <- c(1:10)
b <- c(11:20)
description <- c("Stroke","ALS","Parkinsons","STROKE","STROKE","stroke","Alzheimers","Stroke","ALS","Parkinsons")
df<-data.frame(a,b,description)
df
a b description
1 1 11 Stroke
2 2 12 ALS
3 3 13 Parkinsons
4 4 14 STROKE
5 5 15 STROKE
6 6 16 stroke
7 7 17 Alzheimers
8 8 18 Stroke
9 9 19 ALS
10 10 20 Parkinsons
With this code you can remove every case (row) that is not associated with "Stroke", "STROKE" or "stroke":
df1<-df[!(df$description!="STROKE" & df$description!="Stroke" & df$description!="stroke"),]
df1
a b description
1 1 11 Stroke
4 4 14 STROKE
5 5 15 STROKE
6 6 16 stroke
8 8 18 Stroke
Hope this was what you were looking for.

Related

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Doing a series of operations on every subset of the data obtained from a dataframe

This is a question of a noob in 'R' world. I tried searching and there were quite a few solutions that came close (e.g aggregate, by, etc), but I lacked the understanding to apply it to my problem. Would really appreciate if someone can guide me in a more detailed way.
Hypothetical Dataset
Name Wheels Color Mileage seat_capacity
1 2 Red 70 2
2 3 Black 60 7
3 4 Blue 12 5
4 4 White 15 6
5 3 Yellow 45 6
6 2 Green 70 2
7 3 Silver 45 6
8 6 Silver 5 4
9 14 Red 12 2
10 2 Black 70 7
11 4 Blue 70 5
12 3 White 60 6
13 4 Yellow 12 6
14 4 Green 15 2
I have initially created subsets of data based on color using split.
color <- split(df,df$color)
For each of the subsets created I would be doing more operations e.g
finding the vehicles with highest mileage among the vehicles with lowest number of wheels in each subset.....etc
I have written all the rules pertaining to the later half as well. I am struggling to find a way where I can run all the operations on each of the subset in the variable color.
Any help would be appreciated.
The following worked for me and I would sincerely want to thank #Imo and #aosmith for guiding me.
Assume, I would want to first group the df based on colour and then group further by wheels and then within each such subgroup(wheels) pick top 2 vehicles based on Mileage. Used the dplyr library to achieve the same.
my_list <- df %>% group_by(color, wheels) %>% top_n(2,Mileage)
HTH

Remove lines with efficient way

I have a data frame named df:
number value
1 5
2 5
3 5
4 6
5 6
6 6
7 6
8 7
9 7
10 7
11 7
12 7
13 8
14 9
15 9
I want to remove specific rows in case of a min and max level. I tried separate this:
df[df$value>5 , ]
and after that this:
df[df$value>8 , ]
After I tried this:
df[df$value>5 & df$value>8, ]
but it execute online the df$value>8
and another problem I observed is that when I type
df[df$value>5, ]
it eliminate the value however when I type df it contains the values I tried to remove before. What could be wrong and I don’t take a clear data frames without the removed values?
An example of the output data:
number value
4 6
5 6
6 6
7 6
8 7
9 7
10 7
11 7
12 7
If you want remove lines with level lower than min and higher than max, try this:
df[df$value<5 | df$value>8, ]
Edit
Look right code:
df <- df[df$value>5 & df$value<8,]
Its work for me.

Frequency distribution with custom format data

I need help with a R plot, with a data format I have not worked with before. Please help if you know.
NUMBER FREQUENCY
10 1
11 1
12 3
10 45
11 2
12 3
i need a bar plot with numbers on X axis (continuous, not bins in histogram) and frequency on Y, but combined.
like
10 46
11 3
12 6
it seems simple enough, but i have 10,000 rows and large numbers in real data so I am looking for a good solution in R without doing it manually.
What about:
##tapply splits dd$FREQ by dd$NUM and "sums" them
barplot(tapply(dd$FREQUENCY, dd$NUMBER, sum))
to get:
Read in your data:
dd = read.table(textConnection("NUMBER FREQUENCY
10 1
11 1
12 3
10 45
11 2
12 3"), header=TRUE)

Resources