I have a dataset of 19000. The lenght of the unique patient ID's is 15000.
I want to have a subset of these unique ID's, but with the other variables as in the orginal dataset
patnr age and 25 other variables
1 20
2 21
3 16
4 5
19000
How can i do this? Now i can only see how many unique patient ID's are in this database with this command:
length(unique(data$patnr))
Let's say your data.frame is called, df. You can use unique as follows to select the first instance of a patient ID appearing:
dfUnique <- df[unique(df$patn), ]
Note that this will drop roughly 4,000 rows and you would lose that information if the other variables are different for the same patient in the second observation.
Related
I have (df) has (ID), (Adm_Date), (ICD_10), (points). and it has 1,000,000 rows.
(Points) represent value for (ICD_10)
(ID): each one has many rows
(Adm_Date) from 2010-01-01 to 2018-01-01.
I want the sum (points) without duplicate for filter rows starting from (Adm_date) to 2 years previous back from (Adm_Date) by (ID).
The periods like these:
01-01-2010 to 31-01-2012,
01-02-2010 to 29-02-2012,
01-03-2010 to 31-03-2012,...... so on to the last date 01-12-2016 to 31-12-2018.
my problem is with the filter of the dates. It does not filter the rows based on period date. It does sum (points) for each (ID) without duplicates for all data from the 2010 to 2018 period instead of summing them per period date for each (ID).
I used these codes
start.date= seq(as.Date (df$Adm_Date))
end.date = seq(as.Date (df$Adm_Date+ years(-2)))
Sum_df<- df %>% dplyr::filter(Adm_Date >=start.date & Adm_Date<=end.date) %>%
group_by(ID) %>%
mutate(sum_points = sum(points*!duplicated(ICD_10)))
but the filiter did not work, because it does sum (points) for each (ID) from all dates from the 2010 to 2018 instead of summing them per period date for each (ID).
sum_points will start from 01-01-2012, any Adm_Date >= 01-01-2012 I need to get their sum.
If I looked at the patient has ID=11. I will sum points from row 3 to row 23, Also I need to ignore repeat ICD_10 (e.g. G81, and I69 have repeated in this period). so results show like this
ID(11), Adm_Date(07-05-2012), sum_points(17), while the sum points for the same patient at Adm_Date(13-06-2013) I will sum from row 11 to row 27 because look back for 2 years from Adm_Date. So,
ID(11), Adm_Date(13-06-2013), sum_points(14.9)
I have about a half million of ID and more than a million rows.
I hope I explained it well. Thank you
enter image description here
I have a list with 20 dataframes, each dataframe corrersponds to one patients, and all of them have the same columns:
ID, City, Counts, Day
The column ID is the same for all, the rest change.
So what I want is to create a new dataframe that has a column ID and the Counts values of all the patients.
Something like:
ID Patient1 Patient2 ... Patient20
1
2
3
4
Complete R novice here.
I have wide form data frame which includes a vector/variable for participant_number, with each participant providing two responses (score), with a within-subjects manipulation (code).
enter image description here
However, I have three separate sets of values which corresponded to the participant numbers in three different (between subjects) experimental groups (e.g. control, active_1, active_2).
enter image description here
How can I use these sets of values to create a variable in my main data frame which indicates what experimental group the participant belongs to?
Any help, much appreciated.
The package "dplyr" is quite useful for these kind of things. Let's consider a small working example
df <- data.frame(ID=c(1:7))
ListActive1 <- c(1,3)
ListActive2 <- c(2,5)
ListControl <- c(4,7,6)
df is the main data frame containing the ID of the participant (and of course it may have further columns, e.g. the score etc.) The three vectors contain for each group the IDs of the participants belonging to this particular group, e.g. the participants with ID 2 and 5 belong to the group "Active2".
Now we create a new column in the main data frame using the command mutate which comes with the dplyr package (make sure to install and load it).
df <- mutate(df,group=case_when(
ID %in% ListActive1 ~ "Active1",
ID %in% ListActive2 ~ "Active2",
ID %in% ListControl ~ "Control"))
The command case_when checks for each participant in which of the lists the ID appears and then puts the corresponding label in the new column group.
ID group
1 1 Active1
2 2 Active2
3 3 Active1
4 4 Control
5 5 Active2
6 6 Control
7 7 Control
I have two different data frame below, Scale is the allowable score range/point and in this case is 1-5. Score is the actual score provided by participant using the value defined in Scale. I need count the number of score for each scale point in the Score Data frame. for example, there are two counts of 2 in Score,while zero count of 1.
a<-c(1,2,3,4,5)
b<- c(2,3,4,5,3,4,4,3,3,5,2,3,3)
Scale<-data.frame(Scale =a)
Score<-data.frame(Score=b)
I tried aggregate, but it only identify unique value it found in one data frame and couldn't consult with another one. For example, it will not be able to find there is zero score for 1 in Score and only return counts for 2,3,4,5.
Anyone has any good idea?
Like this?
> table(factor(Score$Score, levels = a))
1 2 3 4 5
0 2 6 3 2
I have a data frame with 21 variables and 1200 observations. The first column is the ID name for each species and column 21 is the total count of all the times each species was seen across multiple sites.
example columns: ID, RM1, RM2, RM10, Total
each row is an ID name and counts per river mile and total count
All I want is a list of the top 20 (or 100 for that matter) most abundant species and their total count. How do I do this?
This is driving me crazy and I don't want to do it in excel - there must be a way in R.
Sort you data frame, lets call it df, by Total, and take top 100
head(df[order(df$Total,decreasing = TRUE), ], 100)