I have to modify existing variable in the data frame using conditional formatting.
df <- data.frame(a = c('upperGI','UpperGI','UpperGI','UpperGI'),
+ b = c('C22.0 - Liver cell carcinoma', 'C16.0 - Cardia', 'C15.3 - Upper third of oesophagus', 'C25.9 - Pancreas, unspecified'))
I would like to split the variable an into upperGI and HPB when b lies between 22.0 to 25.0.
So the first one and second should be upperGI and third and forth would be HPB.
I am trying to learn dplyr package, that would be great to see in that(if possible).
You can try the following :
library(dplyr)
df_new <- df %>%
mutate(num = readr::parse_number(b),
col = if_else(between(b, 22, 25), 'upperGI', 'HPB', missing = 'upperGI'))
We use parse_number to extract numeric from the b column and assign 'upperGI' when b is between 22 and 25 and 'HPB' otherwise.
I would like to split the variable an into upperGI and HPB when b lies between 22.0 to 25.0.
So the first one and second should be upperGI and third and forth would be HPB.
The latter seems a bit odd given the data set you provide (e.g. the first value is 25 >= 22 >= 22 whilst the second is 26.3 >= 25.0). Like Ronak Shah, I assume that you mean that values between 22 and 25 should be "upperGI". In that case, an alternative but similar base R solution to Ronak's is:
transform(df, new_col = c('HPB', 'upperGI')[(b >= 22 & b <= 25) + 1L])
#R> a b new_col
#R> 1 upperGI 22.0 upperGI
#R> 2 UpperGI 26.3 HPB
#R> 3 UpperGi 21.0 HPB
#R> 4 UpperGI 25.0 upperGI
It adds a transform call but saves you three df$s.
Data
df <- data.frame(a = c('upperGI','UpperGI','UpperGi','UpperGI'),
b = c(22.0, 26.3, 21.0, 25.0))
There is also a package called fmtr specifically designed for this situation. You can define a format using the value() function, and apply it using the fapply() function. Like this:
library(fmtr)
# Set up data
df <- data.frame(a = c('upperGI','UpperGI','UpperGI','UpperGI', 'UpperGI'),
b = c('C22.0 - Liver cell carcinoma',
'C16.0 - Cardia',
'C15.3 - Upper third of oesophagus',
'C25.9 - Pancreas, unspecified',
'C23.0 - Livery, pancreas, and biliary surgery'))
df$c <- as.numeric(substr(df$b, 2, 5))
# Define format
fmt <- value(condition(x > 22 && x <= 25, "HPB"),
condition(TRUE, "upperGI"))
# Apply format
df$a <- fapply(df$b, fmt)
df
# a b c
#1 upperGI C22.0 - Liver cell carcinoma 22.0
#2 upperGI C16.0 - Cardia 16.0
#3 upperGI C15.3 - Upper third of oesophagus 15.3
#4 upperGI C25.9 - Pancreas, unspecified 25.9
#5 HPB C23.0 - Livery, pancreas, and biliary surgery 23.0
Related
In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)
I am creating a data frame (hoops) with three columns (t, x, y) and 700 rows. See code at bottom. In the first row, I have set column t to equal 0. In the second row, I want to have the column t be calculated by taking the previous row's t value and add a constant (hoops_var). I want this to formula to continue to row 700.
hoops<-data.frame(t=double(),x=double(),y=double())
hoops_var<- 1.5
hoops[1,1]<- 0
hoops[1,2]<- (hoops$t+23)
hoops[1,3]<- (hoops$t/2)
# What I want for row 2
hoops[2,1]<- hoops[[1,1]]+hoops_var #this formula for rows 2 to 700
hoops[2,2]<- (hoops$t+23) #same as row 1
hoops[2,2]<- (hoops$t/2) #same as row 1
# What I want for row 3 to 700 (same as row 2)
hoops[3:700,1]<- hoops[[2,2]]+hoops_var #same as row 2
hoops[3:700,2]<- (hoops$t+23) #same as rows 1 & 2
hoops[3:700,3]<- (hoops$t/2) #same as row 1 & 2
The first four rows of the table should look like this
The only applicable solution I found (linked at bottom) did not work for me.
I am fairly new to R, so apologies if this is a dumb question. Thanks in advance for any help.
R: Creating a new row based on previous rows
You should use vectorized operations
# first create all columns as vectors
hoops_t <- hoops_var*(0:699) #0:699 gives a vector of 700 consecutive integers
hoops_x <- hoops_t+23
hoops_y <- hoops_t/2
# now we are ready to put all vectors in a dataframe
hoops <- data.frame(t=hoops_t,x=hoops_x,y=hoops_y)
Now if you want to change the t column you can use lag from dplyr to shift all values for example
library(dplyr)
hoops$t[2:nrow(hoops)] <- lag(hoops$x*hoops$y)[2:nrow(hoops)]
I select only [2:nrow(hoops)] (all rows except the first one) because you don't want the first row to be modified
You could use the following :
n <- 10 #Number of rows in the data.frame
t <- seq(0, by = 1.5, length.out = n)
x <- 23 + t
y <- t/2
hoops <- data.frame(t, x, y)
hoops #Sample for 10 rows.
# t x y
#1 0.0 23.0 0.00
#2 1.5 24.5 0.75
#3 3.0 26.0 1.50
#4 4.5 27.5 2.25
#5 6.0 29.0 3.00
#6 7.5 30.5 3.75
#7 9.0 32.0 4.50
#8 10.5 33.5 5.25
#9 12.0 35.0 6.00
#10 13.5 36.5 6.75
Suppose I have a dataframe named score.master that looks like this:
school perc.prof num.tested
A 8 482
B 6-9 34
C 40-49 49
D GE50 81
E 80-89 26
Here, school A's percent proficient is 8%, and the number of students tested is 482. However, suppose that when num.tested falls below a certain number (in this case arbitrarily 100), data suppression is introduced. In most cases, ranges of perc.prof are given but in other cases a value such as "GE50" is given, indicating greater than or equal to 50.
My question is, in a much larger dataset, what is the best way to replace a range with its median? So for example I want the final dataset to look like this:
school perc.prof num.tested
A 8 482
B 8 34
C 44 49
D 75 81
E 85 26
I know this can be done manually like this:
score.master$perc.prof[score.master$perc.prof == "6-9"] <- round(median(6:9), 0)
But the actual dataset has many more range combinations. One way I thought of selecting the correct values is by length; all provided values are 1-2 characters long (no more than 99 percent proficient) whereas the range values are 3 or more characters long.
You can use stringr::str_split() to get the lower and upper bound, then calculate the median. The "GE50" and similar are not generalizable to this, and you could use ifelse() to handle special cases.
df <- data.frame(perc.prof = c('8', '6-9', '40-49', 'GE50', '80-89'))
df$lower.upper <- sapply(stringr::str_split(df$perc.prof, '-'), as.integer)
df$perc.prof.median <- sapply(df$lower.upper, median)
df$lower.upper <- NULL
> df
perc.prof perc.prof.median
1 8 8.0
2 6-9 7.5
3 40-49 44.5
4 GE50 NA
5 80-89 84.5
You could do the following to convert your ranges with the median. However, I did not handle the "GExx" or "LExx" situations since it's not well defined enough.
Note that you would need the stringr package for my solution.
score.master$perc.prof <- sapply(score.master$perc.prof, function(x){
sep <- stringr::str_locate(x, "-")[, 1]
if(is.na(sep)) {
x
} else {
as.character(round(median(as.integer(stringr::str_sub(x, c(1L, sep+1), c(sep-1, -1L))))))
}
})
Here's a tidyverse approach. First I replace "GE50" with it's expected output, then use tidyr::separate to split perc.prof where possible. Last step either uses the given perc.prof if large school, or uses the median for small schools.
library(tidyverse)
df %>%
mutate(perc.prof = if_else(perc.prof == "GE50", "75", perc.prof)) %>%
separate(perc.prof, c("low", "high"), remove = F, convert = T) %>%
mutate(perc.prof.adj = if_else(num.tested > 100,
as.numeric(perc.prof),
rowSums(select(., low, high), na.rm = T)/2)
)
school perc.prof low high num.tested perc.prof.adj
1 A 8 8 NA 482 8.0
2 B 6-9 6 9 34 7.5
3 C 40-49 40 49 49 44.5
4 D 75 75 NA 81 37.5
5 E 80-89 80 89 26 84.5
I have a data frame in R with two columns temp and timeStamp. The data has temp values regularly. A portion of dataframe looks like-
I have to create line chart showing changes in temp over time. As can be seen here, temp values remain the same for several timeStamp. Having these repeating value increases the size of data file and I want to remove them. So the output should look like this-
Showing just the values where there is a change.
Cannot think of a way to get this think done in R. Any inputs in the right direction would be really helpful.
Here's a dplyr solution:
# Toy data
df <- data.frame(time = seq(20), temp = c(rep(60, 5), rep(61, 7), rep(59, 3), rep(60, 5)))
# Now filter for the first and last rows and ones bracketing a temperature change
df %>% filter(temp!=lag(temp) | temp!=lead(temp) | time==min(time) | time==max(time))
time temp
1 1 60
2 5 60
3 6 61
4 12 61
5 13 59
6 15 59
7 16 60
8 20 60
If the data are grouped by a third column (id), just add group_by(id) %>% before the filtering step.
One option would be using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'temp', we subset the first and last observation (.SD[c(1L, .N)]) per each group. If there is only a single value per group, we take the row as such (else .SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD[c(1L, .N)] else .SD, by =temp]
# temp val
#1: 22.50 1
#2: 22.50 4
#3: 22.37 5
#4: 22.42 6
#5: 22.42 7
Or a base R option with duplicated. We check the duplicated values in 'temp' (output is a logical vector), and also check the duplication from the reverse side (fromLast=TRUE). Use & to find the elements that are TRUE in both cases, negate (!) and subset the rows of 'df1'.
df1[!(duplicated(df1$temp) & duplicated(df1$temp,fromLast=TRUE)),]
# temp val
#1 22.50 1
#4 22.50 4
#5 22.37 5
#6 22.42 6
#7 22.42 7
data
df1 <- data.frame(temp=c(22.5, 22.5, 22.5, 22.5, 22.37,22.42, 22.42), val=1:7)
I have a dataframe df1:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
Step = c("A","A","B","B","C","C","C"),
kg = c(31,32,14,16,10,11,10))
Sometimes at a particular 'Step' a 'Lot' gets split into A,B or C as indicated. I'd like to sum those and get a dataframe that tells me the total kg at each step, for each lot.
For example the output should look like this:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018"),
Step = c("A","B","A","C"),
kg = c(31,30,32,31))
So there are two requirements. If the 'Lot' matches, regardless of the trailing letter, and the step matches, then the sum occurs. If both conditions are not satisfied, then just carry over the line item as is into df2.
Part2:
So I would like to introduce a 3rd requirement. In some cases, the Lot was split in two or 3 parts, however not all the data is present. In this case, using these solutions masks this and makes it appear that one lot has much lower kg than it actually has.
What I would like to do is find a way to indicate if the dataset contains 13VC011A for example, but no 13VC011B. Or if we see a 'B' but no 'A' or a 'C' but no 'B' or 'A'.
So now the original dataframe is:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
Step = c("A","A","B","B","C","C","C","B"),
kg = c(31,32,14,16,10,11,10,18))
And the resultant df2 should look something like:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
Step = c("A","B","A","C","B"),
kg = c(31,30,32,31,18),
Partial = c(F,F,F,F,T))
df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
aggregate(kg~Lot+Step,df1, FUN=sum)
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Or using dplyr
library(stringr)
library(dplyr)
df1%>%
group_by(Lot=str_extract(Lot,perl('.*\\d(?=[A-Z]?$)')), Step) %>%
summarize(kg=sum(kg))
#Source: local data frame [4 x 3]
#Groups: Lot
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
Explanation
regex
.* : select more than one element
\\d :followed by a digit
(?=[A-Z]?$) : and lookahead for character elements or (?) not at the $ end of string.
`
> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
Lot Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011A B 14
4 13VC011B B 16
5 13VC018A C 10
6 13VC018B C 10
7 13VC018C C 11
At that point I finally understood what you meant by "regardless of the trailing letter" and wondered if the formula method of aggregate could handle an R-function in one of the terms:
> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
substr(Lot, 1, 7) Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011 B 30
4 13VC018 C 31