this is my first project in R, after just having learned java.
I have a (large) data set that I have imported from a csv file into data frame.
I have identified the two relevent columns for this question, the first that has the name of the patient, and second that asks the patient the level of swelling.
The level of swelling is relative i.e. better, worse or about the same.
Not all patients have the same number of observations.
I am having difficulty converting these relative values into numerical values that can be used as part of a greater analysis.
Below is psuedocode to what i think could be an appropriate solution:
for row in 'patientname'
patientcounter = dtfr1[row, 'patientname'];
if dtfr1[row, 'patientname'] == patientcounter
if dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditioncounter--;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
elseif [row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditoncounter++;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
else
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
if dtfr1[row, 'patientname'] =! patientcounter
patientcounter = dtfr1[row, 'patientname'];
What would your advice be for a good solution to this problem? Thanks!
If I'm understanding correctly, you want the difference in the counts of worse and better, by patient? If so, something like this would work.
# Simulated data
dtfr1 <- data.frame(patient = sample(letters[1:3], 100, replace=TRUE),
condition = sample(c("better", "worse"), 100, replace=TRUE))
head(dtfr1)
# patient condition
# 1 a worse
# 2 b better
# 3 b worse
# 4 a better
# 5 c worse
# 6 a better
better_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "better"))
worse_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "worse"))
worse_count - better_count
# a b c
# 5 0 -1
Related
I would like to calculate the number of days from the time a condition is not met, to when it is met again, in a time series of daily data in R.
Toy data:
day <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2017,07,01,0,0,0),
to = ISOdatetime(2017,08,26,0,0,0),
by = "1 day" ))
var <- c(5,6,5,5,0,0,0,0,0,1,1,2,3,3,4,3,4,5,4,5,5,4,5,4,0,1,1,2,3,4,5,5,5,4,4,4,4,5,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,1,1,0,0)
ts = cbind(day, var)
The condition is var > 3.
I'd like to identify each "recovery" period as the time where var > 0 but <= 3, but only following var going to zero. Then, I'd like the number of days to recovery for each period.
So, for the example data given here, I'd expect this output:
period 1 6
period 2 5
Since var never "recovers" at the end of the dataset, I would either want it not identified as a recovery period, or given a recovery time of 0 days.
I tried this:
ifelse(ts$var >3, 0 ,(ifelse(ts$var>0 & ts$var<4, 1, 0)))
and I think I could pair this if else statement with something that only counts sequential 1s and that would mostly do it. Only problem is that it identifies the end period with the slow drop-off as a “recovery period”, and it shouldn’t. It should only identify periods following a zero as a recovery period.
Here is what this example data look like: plot of var over time. I think it's the minimal data I can provide that show the realistic issues I've had with making counts of data outside of recovery periods.
I need to do this over a long and much more dynamic time series, so an efficient way to do this would be greatly appreciated.
edit
- I don't think this will behave the way you expect it to if var does something like this
[... 0, 1, 2, 1, 0, 2, 4, ...]
But may possibly be adapted to handle this case.
original answer
I haven't tested this much, I'd suggest checking it works with weirder cases (e.g. var is all zeros, starts or ends at a period boundary, other corner cases...)
# ignore zeroes if they precede another zero
s <- which(var == 0 & c(tail(var, -1), NA) != 0)
e <- which(var > 3)
sapply(s, function(x) head(e[e > x], 1) - x)
The approach here is to identify all possible start and end points of periods, then find the first end point that occurs after each start point and taking the difference. A simple loop or maybe even a clever regex could be a good alternative.
Here is an alternative approach which uses the rleid() function from the data.table package to group by contiguous streaks of zero and non-zero values. It then finds the position within each group of the first occurrence of a value > 3:
library(data.table)
setDT(ts)[, if (.GRP > 1) first(which(var > 3)), rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
The first group is skipped because it is either a streak of zeros or has no preceeding zero value.
This approach works even in the case Callum Webb has described in the edit of his answer:
# append data
var <- c(var, 0,1,2,1,0,2,4)
date = seq.POSIXt(
from = ISOdatetime(2017,07,01,0,0,0),
along.with = var,
by = "1 day" )
ts = data.frame(date, var)
setDT(ts)[, if (.GRP > 1) first(which(var > 3)), rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
3: 9 2
So, it has recognized that there is a recovery period of 2 days after the final zero.
For the sake of completeness, in case the sequence 0, 1, 2, 1, 0 is considered to include also a recovery period of 3 days length although it has not reached a value greater 3:
setDT(ts)[, if (.GRP > 1) if (all(var %between% c(1, 3))) .N else first(which(var > 3)),
rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
3: 7 3
4: 9 2
Here all days between two zeros are counted if all values lie between 1 and 3.
I am new to r and I am surprised at how long it takes to run what I believe to be rather simple lines of code, this leads me to believe I am missing something rather obvious. I have searched the internet and tried a few different iterations of the function but nothing has improved the efficiency (measured in time).
The Extract data is a data frame with 18.5m rows and 11 variables. I am trying to establish two things, first what percentage of patients stay in a hospital for longer than 7 as a percentage of all patients and second 21 days stays as a proportion of 7 days.
LOS_prob_providerage <- function(x,y){
Var1 = which(Extract$LOS>=0 & Extract$ProviderCode == x & Extract$age_group == y)
Var2 = which(Extract$LOS>=7 & Extract$ProviderCode == x & Extract$age_group == y)
return(list(Strand=(sum(Extract$LOS[Var1] >= 7)/length(Var1))*100, ELOS=(sum(Extract$LOS[Var2] >= 21)/length(Var2))*100))
}
When I call this function I give it a list of hospitals as the x variable and 1 age group from a list for the y variable (I can't seem to get it to take both as lists and output all hospitals for all age groups) using the following set of code
Providerage_prob_strand = mapply(LOS_prob_providerage,Provider_unique, agelabels[1], SIMPLIFY = FALSE)
I then create a data frame using the 2 lists that the function outputs using the code below
National = data.frame(matrix(unlist(Providerage_prob_strand), ncol=2,
byrow=T),row.names = Provider_unique)
colnames(National) <- c("Stranded_010","ELOS_010")
I subsequently re-run the last portions of code for all 11 elements in my age group list and append to the National data frame.
Question 1: Is there a less computationally intensive way to code my loop using r, or is the loop just taking that length of time due to the way r stores everything in memory?
Question 2: Is there anywhere to give r two lists for both the x and y varibale using mapply/sapply and for it to output the results to both Strand and ELOS across all hospitals /age groups?
I would use the data.table package for this.
Some dummy data to demonstrate (usually it is good practice for the question asker to provide this):
set.seed(123)
df1 = data.frame(
provider = sample(LETTERS[1:4], 1000, T),
los = round(runif(1000,0,40)),
age_group = sample(1:4,1000, T))
Now we turn this into a data table
library(data.table)
setDT(df1)
and we can extact the values you want like this:
providerlist = c('A','B')
age_list = c(1,2)
df1[provider %in% providerlist & age_group %in% age_list,
.(los_greater_than7 = 100*sum(los>7)/.N),
keyby = .(provider, age_group)]
# provider age_group los_greater_than7
# 1: A 1 92.40506
# 2: A 2 81.81818
# 3: B 1 77.27273
# 4: B 2 87.50000
df1[provider %in% providerlist & age_group %in% age_list & los>7,
.(los_greater_than20 = 100*sum(los>20)/.N),
by = .(provider, age_group)]
# provider age_group los_greater_than20
# 1: A 1 56.16438
# 2: A 2 66.66667
# 3: B 1 56.86275
# 4: B 2 58.92857
I am scrapping data from a website, and in this context, data tidying gets kind of hard.
what I have right now is a string of numbers that go into a sequence, let's say
a<-c(1,2,3,1,2,3,4,5,1,2,3,4)
The first value that I'm looking for is 3, the second one is 5, and the third one will be 4.
So basically, I want to go through the sequence 1:5 and choose the highest value, to have the final output as
a<-c(3,4,5)
I thought about choosing the maximum values, such as
a<-sort(a, decreasing = T)
a<-a[1:3]
But this won't count, cause the final product is:
[1] 5 4 4
where the small values are discriminated. Any ideas if this could be possible?
not entirely sure if this is what you're asking for. i think what you're wanting is to see which of your values you have in your vector.
try this:
a<-c(1,2,3,1,2,3,4,5,1,2,3,4)
search_values = 3:5
# unique values
search_values = a[a %in% search_values]
unique(search_values)
# counts of values
table(search_values)
sort(unlist(lapply(split(a, cumsum(c(1, diff(a)) != 1)), max), use.names = FALSE))
#[1] 3 4 5
Sounds like you want something like this?
a <- c(1,2,3,1,2,3,4,5,1,2,3,4) # Data input
a <- unique(a) # Keep unique numbers
a <- sort(a, dec = F) # Sort ascending
tail(a, 3) # Last three numbers in set
Gives:
[1] 3 4 5
In one line:
tail(sort(unique(a), dec = F), 3)
I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])
I have a sequence of data frame subsetting operations. Some of them might fail because the rows to be replaced do not exist. I would still like the others to execute. Example:
source_data[source_data$abbr_d == "bdp",]$party_id <- 32
source_data[source_data$abbr_d == "svp",]$party_id <- 4
source_data[source_data$abbr_d == "cvp",]$party_id <- 2
source_data[source_data$abbr_d == "fdp",]$party_id <- 1
source_data[source_data$abbr_d == "gps",]$party_id <- 13
source_data[source_data$abbr_d == "sp",]$party_id <- 3
source_data[source_data$abbr_d == "csp",]$party_id <- 8
source_data[source_data$abbr_d == "pcs",]$party_id <- 8
Error in `$<-.data.frame`(`*tmp*`, "party_id", value = 13) :
replacement has 1 row, data has 0
source_data[source_data$abbr_d == "lega",]$party_id <- 18
source_data[source_data$abbr_d == "edu",]$party_id <- 16
source_data[source_data$abbr_d == "glp",]$party_id <- 31
I would like the script to continue after the error has been thrown. I've tried using tryCatch() but that doesn't really help because I don't know in advance at which point the replacement will fail.
Is there a way to tell R to just "not care" about those replacement errors? And still continue with the next replacement operations?
The only solution I came up with is to use if-statements like this, which is tedious:
if(nrow(source_data[source_data$abbr_d == "lega", 1]) > 0){
source_data[source_data$abbr_d == "lega",]$party_id <- 18
}
if(nrow(source_data[source_data$abbr_d == "edu", 1]) > 0){
source_data[source_data$abbr_d == "edu",]$party_id <- 16
}
etc...
That is quite verbose code. Luckily, there is a way to get this done in a fraction of the code, and preventing your issue. My suggestion is to use a lookup table to build the party_id column
df = data.frame(abbr_d = sample(LETTERS[1:8], 100, replace = TRUE))
lookup_table = 1:8
names(lookup_table) = LETTERS[1:8]
# A B C D E F G H
# 1 2 3 4 5 6 7 8
df$party_id = lookup_table[df$abbr_d]
So, you create the link between abbr_d and party_id once (here letters and simple numbers, but simply replace your values), and use the df$abbr_d column to subset the lookup table. This maps the labels in abbr_d to the values that correspond to that for party_id.
The error you see is avoided because only addr_d values that are actually in the data are looked up in the lookup table. These unneeded values in the lookup table do not pose an issue.
A dplyr approach as a bonus:
library(dplyr)
df %>% mutate(party_id = lookup_table[abbr_d])
You can use data.table library to mitigate the issue
txt<-"
1,a,1
2,b,2
3,c,3
4,d,4
"
dat = read.delim(textConnection(txt),
header=FALSE,sep=",",strip.white=TRUE)
dat
dat[dat$V2=="e",]$V3<-4
# Error in `$<-.data.frame`(`*tmp*`, "V3", value = 4) :
# le tableau de remplacement a 1 lignes, le tableau remplacé en a 0
library(data.table)
data=as.data.table(dat)
data[data$V2=="e",]$V3<-4
# no error thrown
data.table is often faster than data frame, afaik.