I am at my very first steps with SAS, and I incurred into the following problem which I am not able to solve.
Suppose my dataset is:
data dat;
input id score gender;
cards;
1 10 1
1 10 1
1 9 1
1 9 1
1 9 1
1 8 1
2 9 2
2 8 2
2 9 2
2 9 2
2 10 2
;
run;
What I need to do is to count the number of times the score variable takes values 8, 9 and 10 by id. Then to create the newly variables count8, count9 and count10 such that I can get the following output:
id gender count8 count9 count10
1 1 1 3 2
2 2 1 3 1
How would you suggest ot proceed? Any help would be greatly appreciated.
Lots of ways to do that. Here's a simple one data step approach.
data want;
set dat;
by id;
if first.id then do;
count8=0;
count9=0;
count10=0;
end;
select(score);
when(8) count8+1;
when(9) count9+1;
when(10) count10+1;
otherwise;
end;
if last.id then output;
keep id count8 count9 count10;
run;
SELECT...WHEN is a shortening of a bunch of IF statements, basically (like CASE..WHEN in other languages).
Gender should be dropped, by the way, unless it's always the same by ID (or unless you intend to count by it.)
A more flexible approach than this is to use a PROC FREQ (or PROC MEANS or ...) and transpose it:
proc freq data=dat noprint;
tables id*score/out=want_pre;
run;
proc transpose data=want_pre out=want prefix=count;
by id;
id score;
var count;
run;
If you really only want 8,9,10 and want to drop records less than 8, do so in the data=dat part of PROC FREQ:
proc freq data=dat(where=(score ge 8)) noprint;
Related
I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.
I got a table like this
a b c
-- -- --
1 1 10
2 1 0
3 1 0
4 4 20
5 4 0
6 4 0
The b column 'points' to 'a', a bit like if a is the parent.
c was computed. Now I need to propagate the parent c value to their children.
The result would be
a b c
-- -- --
1 1 10
2 1 10
3 1 10
4 4 20
5 4 20
6 4 20
I can't make an UPDATE/SELECT combo that works
So far I got a SELECT that procuce the c column I'd like to get
select t1.c from t t1 join t t2 on t1.a=t2.b;
c
----------
10
10
10
20
20
20
But I dunno how to stuff that into c
Thanx in advance
Cheers, phi
You have to look up the value with a correlated subquery:
UPDATE t
SET c = (SELECT c
FROM t AS parent
WHERE parent.a = t.b)
WHERE c = 0;
I finnally found a way to copy back my initial 'temp' SELECT JOIN to table 't'. Something like this
create temp table u as select t1.c from t t1 join t t2 on t1.a=t2.b;
update t set c=(select * from u where rowid=t.rowid);
I'd like to know how the 2 solutions, yours with 1 query UPDATE correlated SELECT, and mine that is 2 queries and 1 correlated query each, compare perf wise. Mine seems more heavier, and less aesthetic, yet regarding perf I wonder.
On the Algo side, yours take care not to copy the parent data, only copy child data, mine copy parent on itself, but that's a nop, yet consuming some cycles :)
Cheers, Phi
I have a data set which looks like this:
job_id start_hour duration
1 14 3
2 8 2
Job_id: the id of the job
start_hour: the hour at which the job starts
duration: the number of hours required for the job
I would like to turn it into a table where each line represents an hour for the job:
job_id hour
1 14
1 15
1 16
2 8
2 9
So I would have for each job, as much lines as the job requires hours to be done.
Is there an elegant way to do this in R?
Many thanks
One way to do this is using the package plyr (where d is your original data frame)
ddply(d, .(job_id),
function(d) data.frame(job_id = d$job_id,
hour = d$start_hour:(d$start_hour + d$duration -1)))
This is also possible with simple base functions. First, an input data.frame
#sample data
dd<-data.frame(
job_id = 1:2,
start_hour = c(14, 8),
duration = c(3, 2)
)
Now we use Map to walk through each row and expand it to the right size. Then we combine all the newly expanded rows into one data.frame with do.call(rbind,...)
#transformation
do.call(rbind,Map(function(id,start,dur) {
data.frame(
job_id=rep(id, dur),
hour=seq(from=start, by=1, length.out=dur))
}, dd$job_id, dd$start_hour, dd$duration))
which gives us
job_id hour
1 1 14
2 1 15
3 1 16
4 2 8
5 2 9
I have a file with repeated measures data and another file with single observations for the same persons (e.g. in one file subjects have repeated assessments and the other file just says if subjects are male or female) when I merge the files I get something like this:
ID time gender
1 1 0
1 2
1 3
2 1 1
2 2
3 1 0
3 2
3 3
3 4
but I want that the variable that was measured once (e.g.male/female) to be repeated across time (in each row) for each subject. So I would like to have :
1 1 0
1 2 0
1 3 0
2 1 1
2 2 1
and not do it manually, since I have thousands of cases...
How to do this in SPSS (preferably), or in R ?
You should have used match files with one "file" (multiple record per ID) and one "table" (no duplicate ID's).
But you can probably still fix it by running
sort cases by ID.
if mis(gender) and ID = lag(ID) gender= lag(gender).
Wherever there's no value for gender, it will be filled in with the gender of the previous case if it has the same ID as the current one.
I want to have a loop that will perform a calculation for me, and export the variable (along with identifying information) into a new data frame.
My data look like this:
Each unique sampling point (UNIQUE) has 4 data points associated with it (they differ by WAVE).
WAVE REFLECT REFEREN PLOT LOCAT COMCOMP DATE UNIQUE
1 679.9 119 0 1 1 1 11.16.12 1
2 799.9 119 0 1 1 1 11.16.12 1
3 899.8 117 0 1 1 1 11.16.12 1
4 970.3 113 0 1 1 1 11.16.12 1
5 679.9 914 31504 1 2 1 11.16.12 2
6 799.9 1693 25194 1 2 1 11.16.12 2
And I want to create a new data frame that will look like this:
For each unique sampling point, I want to calculate "WBI" from 2 specific "WAVE" measurements.
WBI PLOT .... UNIQUE
(WAVE==899.8/WAVE==970) 1 1
(WAVE==899.8/WAVE==970) 1 2
(WAVE==899.8/WAVE==970) 1 3
Depends on the size of your input data.frame there could be better solution in terms of efficiency but the following should work ok for small or medium data sets, and is kind of simple:
out.unique = unique(input$UNIQUE);
out.plot = sapply(out.unique,simplify=T,function(uq) {
# assuming that plot is simply the first PLOT of those belonging to that
# unique number. If not yo should change this.
subset(input,subset= UNIQUE == uq)$PLOT[1];
});
out.wbi = sapply(out.unique,simplify=T,function(uq) {
# not sure how you compose WBI but I assume that are the two last
# record with that unique number so it matches the first output of your example
uq.subset = subset(input,subset= UNIQUE == uq);
uq.nrow = nrow(uq.subset);
paste("(WAVE=",uq.subset$WAVE[uq.nrow-1],"/WAVE=",uq.subset$WAVE[uq.nrow],")",sep="")
});
output = data.frame(WBI=out.wbi,PLOT=out.plot,UNIQUE=out.unique);
If the input data is big however you may want to exploit de fact that records seem to be sorted by "UNIQUE"; repetitive data.frame sub-setting would be costly. Also both sapply calls can be combined into one but make it a bit more cumbersome so I had leave it like this.