Here is the data I have:
ID VALUE
1 1
2 -1
3 1
4 1
5 1
1 1
2 1
3 -1
4 1
5 -1
...
How can I get a table like this:
ID value=1 value=-1
1 2 0
2 1 1
3 1 1
4 2 0
5 1 1
value=1 means how many time 1 appears in the value column for each ID
data work.indata ;
input ID VALUE;
cards;
1 1
2 -1
3 1
4 1
5 1
1 1
2 1
3 -1
4 1
5 -1
;
run;
/*
proc sort data=work.indata;
by ID;
run;
*/
proc freq noprint data=work.indata;
tables ID * VALUE /out=WORK.COUNTS nopercent ;
run;
proc transpose data=WORK.COUNTS out=work.output (drop=_name_ _label_);
id value;
by id;
var count;
run;
Related
I would like to find a child's sibling(s) in survey data, check if it has ANY sibling whose age is <= 1 year, and store the result (1,0).
Here is my data:
cluster
house_number
age
1
5
0
1
5
1
1
8
4
1
21
4
1
21
1
2
22
0
2
36
0
2
5
0
2
5
2
2
5
3
I thought of looking for the match between cluster and house_number, and then check the age. But when there is a match how can you check for each child's siblings age and store the result (when it has at least one sibling <= 1 year of age). So you end up with this:
cluster
house_number
age
sibling_age1
1
5
0
1
1
5
1
1
1
8
4
0
1
21
4
1
1
21
1
0
2
22
0
0
2
36
0
0
2
5
0
0
2
5
2
1
2
5
3
1
Do you mean something like this :
# let's call your dataframe : data
# we create a new column called sibling_age on the condition of the age column
# and we use the ifelse function
# the first value represents the if argument
# the second value represents the else argument
data$sibling_age = ifelse(data$age>1,1,0)
# I hope that this is what you were looking for.
I got a dataframe like this:
id day time state
<chr> <dbl> <dbl> <dbl>
1 A 1 1 0
2 A 1 2 0
3 A 1 3 1
4 A 2 1 0
5 A 2 2 1
6 A 2 3 1
7 A 3 1 1
8 A 3 2 1
9 A 3 3 0
In the original dataframe, there are 30 ids and every id has 5 days with 1440 timepoints (so 216000 rows in total).
Now I want to create a new variable called "delta" as result for comparing if the state (1 or 0) is equal between two different timeponts (1 = equal, 0 = unequal).
For example:
Compare if state of day 1 time 1 is = state of day 2 time 1, day 1 time 2 = day 2 time 2....
and then day 2 time 1 = day 3 time 1 and so on.
In the end it should look like this:
id day time state delta
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 1 0 1
2 A 1 2 0 0
3 A 1 3 1 1
4 A 2 1 0 0
5 A 2 2 1 1
6 A 2 3 1 0
7 A 3 1 1 1
8 A 3 2 1 0
9 A 3 3 0 1
I already tried some codes with the ifelse-command, but I could work it out yet.
Assuming my assumption here is correct: you want for the day i time j row, the delta value compares if the state on day i + 1 time j (next day, same time) is equal.
Here's a dplyr method:
library(dplyr)
your_data %>%
group_by(time) %>%
arrange(day) %>%
mutate(delta = as.integer(lead(state) == state))
I have an output for example as below:
ID C1 C2 C3 C4 C5 C6
1 0 1 2 2 1 1
2 0 1 1 2 1 1
3 1 0 1 1 1 1
4 2 0 2 2 1 2
5 2 1 1 0 2 2
6 1 2 1 0 1 2
7 2 2 2 2 0 2
8 1 1 1 1 0 1
9 1 1 2 2 2 0
10 1 2 1 2 1 0
and I determine the co-occurrence of objects through example from faster way to compare rows in a data frame
for ( i in 1:(nr-1)) {
# all combinations of i with i+1 to nr
samplematch <- cbind(dt[i],dt[(i+1):nr])
# renaming the comparison sample columns
setnames(samplematch,append(colnames(dt),paste0(colnames(dt),"2")))
#calculating number of matches
samplematch[,noofmatches := 0]
for (j in 1:nc){
samplematch[,noofmatches := noofmatches+1*(get(paste0("CC",j)) == get(paste0("CC",j,"2")))]
}
# removing individual value columns and matches < 5
samplematch <- samplematch[noofmatches >= 5,list(ID,ID2,noofmatches)]
# adding to the list
totalmatches[[i]] <- samplematch
}
The result obtains through above function help me identify the total matching between each ID. However, i only to identify the matching ID when the CC(1:6) consist only value 1 and 2. Meaning that the total value for each row suppose to be 5 and not 6.
The output that i require should consist information such as
ID1 ID2 Match
1 2 4/5
1 3 2/5
1 4 3/5
: : :
: : :
2 3 3/5
2 4 2/5
How should the function be written without remove any rows since each rows has value 0.
In the code below, IDs is a data table of all pairs of distinct IDs. Then you need to check x <- df[c(ID1, ID2), -1], the non-ID columns of df corresponding to the given ID pair, for each row. The code creates a logical vector which is TRUE for non-zero columns (x[1] != 0) and columns with equal elements (x[2] == x[1]). The sum of this vector is then the number of matches.
library(data.table)
setDT(df)
setkey(df, ID)
IDs <- CJ(ID1 = df$ID, ID2 = df$ID)[ID1 != ID2]
IDs[, Match := {x <- df[c(ID1, ID2), -1]
sum(x[1] != 0 & x[2] == x[1])}
, by = .(ID1, ID2)]
head(IDs)
# ID1 ID2 Match
# 1: 1 2 4
# 2: 1 3 2
# 3: 1 4 3
# 4: 1 5 1
# 5: 1 6 1
# 6: 1 7 2
Data used:
df <- fread('
ID C1 C2 C3 C4 C5 C6
1 0 1 2 2 1 1
2 0 1 1 2 1 1
3 1 0 1 1 1 1
4 2 0 2 2 1 2
5 2 1 1 0 2 2
6 1 2 1 0 1 2
7 2 2 2 2 0 2
8 1 1 1 1 0 1
9 1 1 2 2 2 0
10 1 2 1 2 1 0
')
I work in the healthcare industry and I'm using machine learning algorithms to develop a model to predict when patients will not show up for their appointments. I'm trying to create a new feature that will be the sum of each patient's most recent consecutive no-shows. I've looked around a lot on stackoverflow and other resources, but cannot find exactly what I'm looking for. As an example, if a patient has no-showed her past two most recent appointments, then every row of the new feature's column with her ID will be filled in with 2's. If she no-showed three times, but showed up for her most recent appointment, then the new column will be filled in with 0's.
I tried using plyr's ddply with cumsum, but it did not give me the results I'm looking for. I used:
ddply(a, .(ID), transform, ConsecutiveNoshows = cumsum(Noshow))
Here is an example data set ('1' signifies a no-show):
ID Noshow
1 1
1 1
1 0
1 0
1 1
2 0
2 1
2 1
3 1
3 0
3 1
3 1
3 1
This is my desired outcome:
ID Noshow ConsecutiveNoshows
1 1 2
1 1 2
1 0 2
1 0 2
1 1 2
2 0 0
2 1 0
2 1 0
3 1 1
3 0 1
3 1 1
3 1 1
3 1 1
I'll be very grateful for any help. Thank you.
The idea is to sum() for each ID the number of Noshow before a 0 appears.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ConsecutiveNoshows = sum(!cumsum(Noshow == 0) >= 1))
Which gives:
#Source: local data frame [13 x 3]
#Groups: ID [3]
#
# ID Noshow ConsecutiveNoshows
# <int> <int> <int>
#1 1 1 2
#2 1 1 2
#3 1 0 2
#4 1 0 2
#5 1 1 2
#6 2 0 0
#7 2 1 0
#8 2 1 0
#9 3 1 1
#10 3 0 1
#11 3 1 1
#12 3 1 1
#13 3 1 1
I am having an issue with the grep function. Specifically, when I tell R to get all the columns that start with a certain letter using the function, and there is only one such column, all that is yielded is the data with the code as the column name like this:
> head(newdat1)
i1 b2 b1 b17
1 1 1 2 0
2 1 1 2 0
3 1 1 2 0
4 1 1 2 0
5 2 1 1 0
6 3 1 1 1
datformeanfill<-as.data.frame(newdat1[,grep("^i", colnames(newdat1))])
> head(datformeanfill)
newdat1[, grep("^i", colnames(newdat1))]
1 1
2 1
3 1
4 1
5 2
6 3
As opposed to if I have two or more columns that start with the same letter:
datnotformeanfill<-as.data.frame(newdat1[,grep("^b", colnames(newdat1))])
> head(datnotformeanfill)
b2 b1 b17
1 1 2 1
2 1 2 1
3 1 2 1
4 1 2 1
5 1 1 1
6 1 1 2
Where we see the column names are maintained, and it does the same if I have multiple "i". Please help thanks!
Use
datformeanfill <- newdat1[,grep("^i", colnames(newdat1)), drop=FALSE]
to ensure you always get back a data.frame. See ?'[.data.frame' for the details.