In R, can I concatenate, then call variable column=concatenated string? - r

I am trying to identify the time of primary ambulance arrival for a number of patients in my dataframe=data.
The primary ambulance is either the 1st, 2nd, 3rd or 4th vehicle on scene (data$prim.amb.num=1, 2, 3, or 4 for each patient/row).
data$time_v1, data$time_v2, data$time_v3 and data$time_v4 have a time or a missing value, which corresponds to the 1st, 2nd, 3rd and 4th vehicles, where relevant.
What I would like to do is make a new variable=prim.amb.time with the time that corresponds to primary ambulance arrival time. Suppose for patient=1, the ambulance was the first. Then I want data[1,"prim.amb.time"]=data[1,"time_v1"].
I can figure out the correct time_v* with the following:
paste("time_v", data$prim.amb.num, sep="")
But I'm stuck as to how to pass the resulting information to call the correct column.
My hope was to simply have something like:
data$prim.amb.time<-data$paste("time_v", data$prim.amb.num, sep="")
but of course, this doesn't work. I'm not even sure how to Google for this; I tried various combinations of this title but to no avail. Any suggestions?

Although I liked the answer by #mhermans, if you want a one-liner, one solution is to use ?apply as follows:
#From #mhermans
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
#Take each row of d and pull out time_vn where n = d$prime.amb.num
d$prime.amb.time <- apply(d, 1, function(x) {x[x['prime.amb.num'] + 2]})
> d
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4 prime.amb.time
1 1000 1 30 40 60 100 30
2 1001 3 40 50 60 80 60
3 1002 2 10 30 40 45 30
4 1003 1 24 40 45 60 24
EDIT - or with paste:
d$prime.amb.time <-
apply(
d,
1,
function(x) {
x[paste('time_v', x['prime.amb.num'], sep = '')]
}
)
#Gives the same result

Set up example data:
# read in basic example data for four patients, wide format
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
In the example dataset I'm thus assuming your data looks like this:
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1 1000 1 30 40 60 100
2 1001 3 40 50 60 80
3 1002 2 10 30 40 45
4 1003 1 24 40 45 60
Given that data structure, it is perhaps easier to work with a dataset with a vehicle per row, instead of a patient per row. This can be accomplised by using reshape() to convert from a wide to a long format.
dl <- reshape(d, direction='long', idvar="patient.id", varying=list(3:6))
# ordering & rename var for aesth. reasons:
dl <- dl[order(dl$patient.id, dl$time),]
dl$vehicle.id <- dl$time
dl$time <- NULL
dl
This gives a long dataset, with a row per vehicle:
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1000.2 1000 1 40 2
1000.3 1000 1 60 3
1000.4 1000 1 100 4
1001.1 1001 3 40 1
1001.2 1001 3 50 2
1001.3 1001 3 60 3
1001.4 1001 3 80 4
1002.1 1002 2 10 1
1002.2 1002 2 30 2
1002.3 1002 2 40 3
1002.4 1002 2 45 4
1003.1 1003 1 24 1
1003.2 1003 1 40 2
1003.3 1003 1 45 3
1003.4 1003 1 60 4
Getting the arrival time of the first ambulance per patient then become a simple oneliner:
dl[dl$prime.amb.num == dl$vehicle.id,]
which gives
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1001.3 1001 3 60 3
1002.2 1002 2 30 2
1003.1 1003 1 24 1

Related

Replace value with the mean based on two classes

I have a dataset with 2 calendar variables (Week & Hour) and 1 Amount variable:
Week Hour Amount
35 1 367
35 2 912
36 1 813
36 2 482
37 1 112
37 2 155
35 1 182
35 2 912
36 1 551
36 2 928
37 1 125
37 2 676
I wish to replace each value of Amount with the mean from each observation with the same Week/Hour pair. For instance, here there are 2 obs. for (Week=35, Hour=1), with Amount values of 367 and 182. Hence, for this example, the 2 rows with (Week=35, Hour=1) should have the Amount replaced with mean(c(367,182). The final output should be:
Week Hour Amount
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
35 1 274.5
35 2 912.0
36 1 682.0
36 2 705.0
37 1 118.5
37 2 415.5
I have the following code that solves this issue. However, for the complete dataset with thousands of rows, it is very slow. Is there any way to automatically reshape with with this paired means?
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))
means <- reshape2::dcast(dataset, Week~Hour, value.var="Value", mean)
for (i in 1:nrow(dataset)) {
print(i)
dataset$Amount[i] <- means[means$Week==dataset$Week[i],which(colnames(means)==dataset$Hour[i])]
}
Possible solution with dplyr:
dataset %>%
group_by(Week, Hour) %>%
summarise(mean_amount = mean(Amount))
You group by Week and Hour and calculate the mean based on this condition.
EDIT
To keep the original structure (number of rows) alter the code to
dataset %>%
group_by(Week, Hour) %>%
mutate(Amount = mean(Amount))
If the idea is just to get the mean Amount by Week and Hour, this would work:
aggregate(Amount ~ ., dataset, mean)
Week Hour Amount
1 35 1 274.5
2 36 1 682.0
3 37 1 118.5
4 35 2 912.0
5 36 2 705.0
6 37 2 415.5
EDIT:
If, however, the idea is to put the averages back into the dataset, then this should work:
x <- aggregate(Amount ~ ., dataset, mean)
dataset$Amount <- x$Amount[match(apply(dataset[,1:2], 1, paste0, collapse = " "),
apply(x[,1:2], 1, paste0, collapse = " "))]
dataset
Week Hour Amount
1 35 1 274.5
2 35 2 912.0
3 36 1 682.0
4 36 2 705.0
5 37 1 118.5
6 37 2 415.5
7 35 1 274.5
8 35 2 912.0
9 36 1 682.0
10 36 2 705.0
11 37 1 118.5
12 37 2 415.5
Explanation:
This pastes together into strings the rows of the first two columns in the means dataframe x and in datasetusing the function apply it uses match on these strings to assign the means values to the corresponding rows in dataset
EDIT 2:
Alternatively, you can use interaction and, respectively, %in% for this transformation:
dataset$Amount <- x$Amount[match(interaction(dataset[,1:2]), interaction(x[,1:2]))]
# or:
dataset$Amount <- x$Amount[interaction(x[,1:2]) %in% interaction(dataset[,1:2])]
Base R solution:
dataset$Amount <- with(dataset, ave(dataset$Amount, dataset$Week, dataset$Hour, FUN = mean))
Data:
dataset = data.frame(Week=c(35,35,36,36,37,37,35,35,36,36,37,37),
Hour = c(1,2,1,2,1,2,1,2,1,2,1,2),
Amount = c(367,912,813,482,112,155,182,912,551,928,125,676))

Rank system in R, recursive function

I really don't have idea what I'm looking for, if a loop, recursive function or maybe something different.
This is my toy dataset:
ID1 S1 S2 S3
1 10 20 30
2 20 30 40
1 50 60 70
3 20 40 50
1 10 30 10
2 40 20 20
toy$OLD_RANK = find previous row with same ID1 and copy NEW RANK of that row. If no row with same ID1 give assigned value (10 in this example)
toy$NEW_RANK = OLD_RANK + S1+S2+S3
expected result:
ID1 S1 S2 S3 OLD_RANK NEW_RANK
1 10 20 30 10 70
2 20 30 40 10 100
1 50 60 70 70 250
3 20 40 50 10 120
1 10 30 10 280 330
2 40 20 20 100 180
dataframe for R as requested:
toy <- matrix(c(1,10,20,30,2,20,30,40,1,50,60,70,3,20,40,50,1,10,30,10,2,40,20,20),ncol=4,byrow=TRUE)
colnames(toy) <- c("ID1","S1","S2","S3")
toy <- as.data.frame(database )

Replicating an Excel SUMIFS formula

I need to replicate - or at least find an alternative solution - for a SUMIFS function I have in Excel.
I have a transactional database:
SegNbr Index Revenue SUMIF
A 1 10 30
A 1 20 30
A 2 30 100
A 2 40 100
B 1 50 110
B 1 60 110
B 3 70 260
B 3 80 260
and I need to create another column that sums the Revenue, by SegmentNumber, for all indexes that are equal or less the Index in that row. It is a distorted rolling revenue as it will be the same for each SegmentNumber/Index key. This is the formula is this one:
=SUMIFS([Revenue],[SegNbr],[#SegNbr],[Index],"<="&[#Index])
Let's say you have this sample data.frame
dd<-read.table(text="SegNbr Index Revenue
A 1 10
A 1 20
A 2 30
A 2 40
B 1 50
B 1 60
B 3 70
B 3 80", header=T)
Now if we make sure the data is ordered by segment and index, we can do
dd<-dd[order(dd$SegNbr, dd$Index), ] #sort data
dd$OUT<-with(dd,
ave(
ave(Revenue, SegNbr, FUN=cumsum), #get running sum per seg
interaction(SegNbr, Index, drop=T),
FUN=max, na.rm=T) #find largest sum per index per seg
)
dd
This gives
SegNbr Index Revenue OUT
1 A 1 10 30
2 A 1 20 30
3 A 2 30 100
4 A 2 40 100
5 B 1 50 110
6 B 1 60 110
7 B 3 70 260
8 B 3 80 260
as desired.

Average deviation from several columns based on a single column in a dataframe in R

This is my first post on here and I am pretty new to R.
I have a huge datafile that looks like the example below.
> name = factor(c("A","B","C","D","E","F","G","H","H"))
> school = c(1,1,1,2,2,2,3,3,3)
> age = c(10,20,0,30,40,50,60,NA,70)
> mark = c(100,70,100,50,90,100,NA,50,50)
> data = data.frame(name=name,school=school,age=age)
name school age mark (many other trait columns)
A 1 10 100
B 1 20 70
C 1 NA 100
D 2 30 50
E 2 40 90
F 2 50 100
G 3 60 NA
H 3 NA 50
H 3 70 50
What I need to do is calculate the average of many traits per school and for each trait I want to create to other columns, one with the mean per school for the trait and another one with the average deviation. I also have trait values of "zero" and "NA", which I dont want to include in the mean calculation. The file I need would look like this:
name school age agemean agedev mark markmean markdev (continue for other traits)
A 1 10 15 -5 100 90 10
B 1 20 15 5 70 90 -20
C 1 0 15 0 100 90 10
D 2 30 40 -10 50 80 -30
E 2 40 40 0 90 80 10
F 2 50 40 10 100 80 20
G 3 60 65 -5 NA 50 0
H 3 NA 65 0 50 50 0
H 3 70 65 5 50 50 0
I did a search on here and found some similar questions, but I didnt get how to apply to my case. I tried to use the agreggate function, but it is not working. Any help would be very much appreciated.
Cheers.
Sounds like a good job for dplyr. Here's how you could do it if you want to keep all existing rows per school:
require(dplyr)
data %>%
group_by(school) %>%
mutate_each(funs(mean(., na.rm = TRUE), sd(., na.rm = TRUE)), -name)
#Source: local data frame [9 x 8]
#Groups: school
#
# name school age mark age_mean mark_mean age_sd mark_sd
#1 A 1 10 100 15 90 7.071068 17.32051
#2 B 1 20 70 15 90 7.071068 17.32051
#3 C 1 NA 100 15 90 7.071068 17.32051
#4 D 2 30 50 40 80 10.000000 26.45751
#5 E 2 40 90 40 80 10.000000 26.45751
#6 F 2 50 100 40 80 10.000000 26.45751
#7 G 3 60 NA 65 50 7.071068 0.00000
#8 H 3 NA 50 65 50 7.071068 0.00000
#9 H 3 70 50 65 50 7.071068 0.00000
If you want to reduce each school to a single-row-summary, you can replace mutate_each with summarise_each in the code above.

count a variable by using a condition in R

I am new in R . I have a data frame containing 3 columns.
first one shows ID , for each household we have a uniqe ID. the other columns shows relationship(1 for father , 2 for mother and 3 for children . third columns shows their age.
now i want to know how many twins are there in each family. ( twins are childs that have same age in each family)
my data frame:
Id relationship age
1001 1 60
1001 2 50
1001 3 20
1002 1 70
1002 2 68
1002 3 23
1002 3 27
1002 3 27
1002 3 23
1003 1 60
1003 2 40
1003 3 20
1003 3 20
result:
id twins
1001 0
1002 2
1003 1
Here's an R base solution using aggregate
> aggregate(age ~ Id, function(x) sum(duplicated(x)), data=df[df[,2]==3, ])
Id age
1 1001 0
2 1002 2
3 1003 1
It's a little difficult to attempt these without a working example. You can use dput() to create one. ... but I think this should work.
library(plyr)
df= df[df$relationship==3,]
ddply(df, .(id,age), nrow)
or rather it gives the number of children (not just twins)
almost <- ddply(df[df$relationship==3,], .(Id,age), function(x) nrow(x)-1)
aggregate(almost$V1, list(almost$Id), FUN =sum )
# Group.1 x
#1 1001 0
#2 1002 2
#3 1003 1

Resources