Suppose I have a dataframe named score.master that looks like this:
school perc.prof num.tested
A 8 482
B 6-9 34
C 40-49 49
D GE50 81
E 80-89 26
Here, school A's percent proficient is 8%, and the number of students tested is 482. However, suppose that when num.tested falls below a certain number (in this case arbitrarily 100), data suppression is introduced. In most cases, ranges of perc.prof are given but in other cases a value such as "GE50" is given, indicating greater than or equal to 50.
My question is, in a much larger dataset, what is the best way to replace a range with its median? So for example I want the final dataset to look like this:
school perc.prof num.tested
A 8 482
B 8 34
C 44 49
D 75 81
E 85 26
I know this can be done manually like this:
score.master$perc.prof[score.master$perc.prof == "6-9"] <- round(median(6:9), 0)
But the actual dataset has many more range combinations. One way I thought of selecting the correct values is by length; all provided values are 1-2 characters long (no more than 99 percent proficient) whereas the range values are 3 or more characters long.
You can use stringr::str_split() to get the lower and upper bound, then calculate the median. The "GE50" and similar are not generalizable to this, and you could use ifelse() to handle special cases.
df <- data.frame(perc.prof = c('8', '6-9', '40-49', 'GE50', '80-89'))
df$lower.upper <- sapply(stringr::str_split(df$perc.prof, '-'), as.integer)
df$perc.prof.median <- sapply(df$lower.upper, median)
df$lower.upper <- NULL
> df
perc.prof perc.prof.median
1 8 8.0
2 6-9 7.5
3 40-49 44.5
4 GE50 NA
5 80-89 84.5
You could do the following to convert your ranges with the median. However, I did not handle the "GExx" or "LExx" situations since it's not well defined enough.
Note that you would need the stringr package for my solution.
score.master$perc.prof <- sapply(score.master$perc.prof, function(x){
sep <- stringr::str_locate(x, "-")[, 1]
if(is.na(sep)) {
x
} else {
as.character(round(median(as.integer(stringr::str_sub(x, c(1L, sep+1), c(sep-1, -1L))))))
}
})
Here's a tidyverse approach. First I replace "GE50" with it's expected output, then use tidyr::separate to split perc.prof where possible. Last step either uses the given perc.prof if large school, or uses the median for small schools.
library(tidyverse)
df %>%
mutate(perc.prof = if_else(perc.prof == "GE50", "75", perc.prof)) %>%
separate(perc.prof, c("low", "high"), remove = F, convert = T) %>%
mutate(perc.prof.adj = if_else(num.tested > 100,
as.numeric(perc.prof),
rowSums(select(., low, high), na.rm = T)/2)
)
school perc.prof low high num.tested perc.prof.adj
1 A 8 8 NA 482 8.0
2 B 6-9 6 9 34 7.5
3 C 40-49 40 49 49 44.5
4 D 75 75 NA 81 37.5
5 E 80-89 80 89 26 84.5
Related
I have 1000+ rows and I want to calculate the CV for each row that has the same condition.
The data look like this:
Condition Y
0.5 25
0.5 26
0.5 27
1 43
1 45
1 75
5 210
5 124
5 20
10 54
10 78
10 10
and then I did:
CV <- function(x){
(sd(x)/mean(x))*100
}
CV.for every row. <- aggregate(y ~ Condition,
data = df,
FUN = CV)
I have the feeling that what I did, uses the mean of the whole column, cause the results are a bit whatever.
I have a dataframe (sample of the following form):
DateTime Ind1 Ind2 V1 V2 Ac1 Ac2 w1 w2 w3 shift
2016-05-01 00:01:00 U A 5 7 20 100 50 70 200 1
2016-05-01 00:01:20 U A 5 7 20 109 35 77 140 1
2016-05-01 00:01:40 U A 5 7 40 120 55 97 160 1
...
2016-05-01 00:08:20 U A 5 7 15 157 70 70 204 2
...
2016-05-02 00:08:20 U A 5 7 28 147 65 90 240 2
...
2016-05-02 00:20:00 U A 5 7 35 210 45 100 167 3
I need a new dataframe where some statistics (e.g. mean, standard deviation) for the columns v1 to w3 are listed for each date-and-shift combination, something similar to the following:
Date shift Ind1 Ind2 avgV1 sdV1 avgV2 sdV2 avgAC1 ....
2016-05-01 1 U A 5.3 2.9 7.8 4.5 108 .....
2016-05-01 2 U A 6.7 3.5 8.9 5.0 99 .....
SOLUTION TRIED:
I can do the following steps.
1) extract date from DateTime
df$Date <- format(as.POSIXct(df$DateTime, format="%Y-%m-%d %H:%M:%S"), format="%Y-%m-%d")
2) label the data by date and shift.
df$DateShift <- paste(df$Date, df$shift)
3) for each subset, calculate some statistics on a col:
tmp_df <- data.frame(levels(as.factor(df$DateShift)))
avgV1 <- tapply(df$V1, df$DateShift, FUN=mean)
sdV1 <- tapply(df$V1, df$DateShift, FUN=sd)
avgV2<- tapply(df$V2, df$DateShift, FUN=mean)
....
However, I have more than 50 columns in the original dataframe, with different types of names (not as simple as in the example above).
Moreover, the statistics that I want to compute may vary (say, calculation of max and min, or some other user-defined function).
So I don't want to code by hand for the different combinations of columns and type of statistic (mean, standard dev, etc.)
What is the way to automate this?
I am sure the dplyr solutions are coming, but the doBy package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.
library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)
Will give a dataframe with V1.mean, V1.median, and so on.
The . ~ means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2, for example.
library(dplyr)
df %>%
mutate(Date = as.Date(DateTime)) %>%
group_by(Date, shift) %>%
summarise_each(funs(mean))
I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))
I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8
I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).