PSM in R with specific lines - r

to get matched pairs due to PSM ("Matchit"-Package and Method = full) i need to specifiy my command for my longitudinal data frame. Every Case has several obeservations but i only need the first observation per patient to be included in the Matching. So the matching should be based on every patients' first observation but my later analysis should include the complete dataset of each patient with all observations.
Has anyone an idea how to achieve this?
I tried using a data subset (first observation per patient) but wasn't able to get the matching included in the data set (with all observations per patient) using "Match.data".
Thanks in advance
Simon (desperately writing his masters thesis)

My udnerstanding is that you want to create matches at just the first time point but have those matches be identified for each unit at all time points. Fortunatly, this is pretty straightforward: just perform the matching at the first time point and then merge the matched dataset with the full dataset. Here is how this might look. Let's say your original long dataset is d and has an ID column id and a time column time.
m <- matchit(treat ~ X1 + X2, data = subset(d, time == 1), method = "full")
md1 <- match.data(m)
d <- merge(d, md1[c("id", "subclass", "weights")], by = "id", all.x = TRUE)
Your new dataset should have two new columns, subclass and weights, which contain the matching subclass and matching weight for each unit. Rows with identical IDs (i.e., rows corresponding to the same unit at multiple time points) will have the same value of subclass and weight.

Related

Creating horizontal dataframe from vertical table (with repeated variables)

Need to create usable dataframe using R or Excel
Variable1
ID
Variable2
Name of A person 1
002157
NULL
Drugs used
NULL
3.0
Days in hospital
NULL
2
Name of a surgeon
NULL
JOHN T.
Name of A person 2
002158
NULL
Drugs used
NULL
4.0
Days in hospital
NULL
5
Name of a surgeon
NULL
ADAM S.
I have a table exported from 1C (accounting software). It contains more than 20 thousand observations. A task is to analyze: How many drugs were used and how many days the patient stayed in the hospital.
For that reason, I need to transform the one dataframe into a second dataframe, which will be suitable for doing analysis (from horizontal to vertical). Basically, I have to create a dataframe consisting of 4 columns: ID, drugs used, Hospital stay, and Name of a surgeon. I am guessing that it requires two functions:
for ID it must read the first dataframe and extract filled rows
for Name of a surgeon, Drugs used and Days in hospital the function have to check that the row corresponds to one of that variables and extracts date from the third column, adding it to the second dataframe.
Shortly, I have no idea how to do that. Could you guys help me to write functions for R or tips for excel?
for R, I guess you want something like this:
load the table, make sure to substitute the "," with the separator that is used in your file (could be ";" or "\t" for tab etc.).
df = read.table("path/to/file", sep=",")
create subset tables that contain only one row for the patient
id = subset(df, !is.null(ID))
drugs = subset(df, Variable1 %in% "Drugs used")
days = subset(df, Variable1 %in% "Days in hospital")
#...etc...
make a new data frame that contains these information
new_df = data.frame(
id = id$ID,
drugs = drugs$Variable2,
days = days$Variable2,
#...etc...no comma after the last!
)
EDIT:
Note that this approach only works if your table is basically perfect! Otherwise there might be shifts in the data.
#=====================================================
EDIT 2:
If you have an imperfect table, you might wanna do something like this:
Step 1.5) , change all NA-values (which in you table is labeled as NULL, but I assume R will change that to NA) to the patient ID. Note that the is.na() function in the code below is specifically for that, and will not work with NULL or "NULL" or other stuff:
for(i in seq_along(df$ID)){
if(is.na(df$ID[i])) df$ID[i] <- df$ID[i-1]
}
Then go again to step 2) above (you dont need the id subset though) and then you have to change each data frame a little. As an example for the drugs and days data frames:
drugs = drugs[, -1] #removes the first column
colnames(drugs) = c("ID","drugs") #renames the columns
days = days[, -1]
colnames(days) = c("ID", "days")
Then instead of doing step 3 as above, use merge and choose the ID column to be the merging column.
new_df = merge(drugs, days, by="ID")
Repeat this for other subsetted data frames:
new_df = merge(new_df, surgeon, by="ID")
# etc...
That is much more robust and even if some patients have a line that others dont have (e.g. days), their respective column in this new data frame will just contain an NA for this patient.

R Using lag() to create new columns in dataframe

I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.
The dataset looks like this -
set.seed(1234)
library(dplyr)
library(tidyr)
dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
YearN=c(2015,2016,2017,2018,2019,
2015,2016,2017,2018,2019,
2015,2016,2017,2018,
2015,2016,2017,2018,
2015,2016,2017,
2015,2016,2017),
Question=c(rep('Overall Satisfaction',5),
rep('Overall Cleanliness',5),
rep('Overall Satisfaction',4),
rep('Overall Cleanliness',4),
rep('Overall Satisfaction',3),
rep('Overall Cleanliness',3)),
ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))
MY OBJECTIVE
To add two columns to the dataset such that -
The first column contains the score for the given question in the given
hospital for the previous year
The second column contains the total number of respondents for the given question in the given hospital for the previous year
MY ATTEMPT
I called the first column ScoreYearN-1 and the second column TotalYearN-1
I used the lag function to create the new columns that contain the lagged values from the existing columns.
library(dplyr)
library(tidyr)
dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)
Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).
The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.
I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.
Any help will be greatly appreciated.
You're close! Just use dplyr to group by hospital
dataset_lagged <- dataset %>%
group_by(Hospital,Question) %>%
mutate(`ScoreYearN-1` = lag(ScoreYearN),
`TotalYearN-1` = lag(TotalYearN))

Match coordinates in dataset with nearest neighbor more efficiently - and retain names?

I'm trying to match businesses with the nearest business within the same dataset. The dataset consists of the business name, coordinates, and star rating (out of 5). There is a row for every business, per quarter, per year (about 5 years). While I've figured this out for simply the nearest business, I need to do this ~100 more times based on different criteria. (Splitting into different quarters for each year, matching neighbors with same ratings, etc).
Right now all I can think of doing is splitting the dataset to include only the businesses I need based on the criteria and then matching like that — but doing it 100 times, and then rejoining the data, sounds...like a terrible idea. I'm not too experienced with spatial stuff in R, so if anyone has any ideas or help that would be amazing.
Similarly, for the existing code I have (below), this provides me with the ID # of the nearest neighbor, but not the name. Is there a way to match neighbors while retaining other columns other than simply the ID #?
The code I already have is below. I'm using
library(RANN)
sp.data <- data
coordinates(sp.data) <- ~lon+lat
sf.data <- st_as_sf(sp.data)
dist_m <- st_distance(sf.data, sf.data)
index <- apply(dist_m, 1, order)
index <- index[2:nrow(index), ]
index <- index[1, ]
index2 <- as.data.frame(index)
index2 <- data.frame(t(index2))
This then provides me with the ID# of the nearest neighbor. As you can see, running this over and over again with different data (split from the original data based on rating criteria and quarter, etc) isn't very efficient. Any help is appreciated. Thank you!

Normalize data by use of ratios based on a changing dataset in R

I am trying to normalize a Y scale by converting all values to percentages.
Therefore, I need to divide every number in a column by the first number in that column. In Excel, this would be equivalent to locking a cell A1/$A1, B1/$A1, C1/$A1 then D1/$D1, E1/$D1...
The data needs to first meet four criteria (Time, Treatment, Concentration and Type) and the reference value changes at every new treatment. Each treatment has 4 concentrations (0, 0.1, 2 and 50). I would like for the values associated to each concentration to be divided by the reference value (when the concentration is equal to 0).
The tricky part is that this reference value changes every 4 rows.
I have tried doing this using ddply:
`MasterTable <- read.csv("~/Dropbox/Master-table.csv")`
MasterTable <- ddply(MasterTable, .(Time, Type, Treatment), transform, pc=(Value/Value$Concentration==0))
But this is not working at all. Any help would be really appreciated!
My data file can be found here:
Master-table
Thank you!
dplyr is very efficient here:
library(dplyr)
result <- group_by(MasterTable, Time, Type, Treatment) %>%
mutate(pc = Value / Value[1])

Add column from another data.frame based on multiple criteria

I have 2 data frames:
cars = data.frame(car_id=c(1,2,2,3,4,5,5),
max_speed=c(150,180,185, 200, 210, 230,235),
since=c('2000-01-01', '2000-01-01', '2007-10-01', '2000-01-01', '2000-01-01', '2000-01-01', '2009-11-18'))
voyages = data.frame(voy_id=c(1234,1235,1236,1237,1238),
car_id=c(1,2,3,4,5),
date=c('2000-01-01', '2002-02-02', '2003-03-03', '2004-04-04', '2010-05-05'))
If you look closely you can see that the cars occasionally has multiple entries for a car_id because the manufacturer decided to increase the max speed of that make. Each entry has a date marked by since that indicates the date from which the actual max speed is applied.
My goal: I want to add the max_speed variable to the voyages data frame based on the values found in cars. I can't just join the 2 data frames by car_id because I also have to check the date in voyages and compare it to since in cars to determine the proper max_speed
Question: What is the elegant way to do this without loops?
One approach:
Merge the two datasets, including duplicated observations in "cars".
Drop any observations where the date for "since" is later than the date for "date". Order the dataset so most recent dates are first, then drop duplicated observations for "voy_id"--this ensures that where there are two dates in "since", you'll only keep the most recent one that occurs before the voyage date.
z <- merge(cars, voyages, by="car_id")
z <- z[as.Date(z$since)<=as.Date(z$date),]
z <- z[order(as.Date(z$since), decreasing=TRUE),]
z <- z[!duplicated(z$voy_id),]
Also curious to see if someone comes up with a more elegant, parsimonious approach.

Resources