Loop in a column-with conditions only once - r

data<- c(100,101,102,103,104,99,98,97,94,93,103,90,104,105,110)
date<- Sys.Date()-15:1
file<- xts(data,date)
colnames(file)<- "CLOSE"
file$high<- cummax(file$CLOSE)
file$trade <- 0
file$trade[file$high*.95>=file$CLOSE] <- 1
file$trade[file$high*.90>=file$CLOSE] <- 2
file$trade[file$high*.85>=file$CLOSE] <- 3
file
CLOSE high trade
2013-07-06 100 100 0
2013-07-07 101 101 0
2013-07-08 102 102 0
2013-07-09 103 103 0
2013-07-10 104 104 0
2013-07-11 99 104 0
2013-07-12 98 104 1
2013-07-13 97 104 1
2013-07-14 94 104 1
2013-07-15 93 104 2
2013-07-16 103 104 0
2013-07-17 90 104 2
2013-07-18 104 104 0
2013-07-19 105 105 0
2013-07-20 110 110 0
I need to modify trade column, so after i get my first "1" then all elements would be zero until i get 2 and then all elements should be 0, till i get 3 and so on.

I think, you could simply do:
> file$trade[duplicated(file$trade)] <- 0

You don't need a loop to do this. Indeed, you simply need to find the positions of the first "1", "2",.... Try the following codes.
rank.trade <- rank(file$trade, ties.method = "first")
marks <- cumsum(head(table(file$trade), -1)) + 1
black.list <- is.na(match(rank.trade, marks))
file$trade[black.list] <- 0

Related

Cannot read .data file under R?

Good morning,
I need to read the following .data file : https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data
For this , I tried without success :
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data", open="r" ,encoding="UTF-16LE")
data <- read.table(f, dec=",", header=F)
Thank you a lot for help!
I would try to use the coatless/ucidata package to access the data.
https://github.com/coatless/ucidata
Here you can see how the package loads in the data file and processing:
https://github.com/coatless/ucidata/blob/master/data-raw/heart_disease_build.R
If you wish to try out the package, you will need devtools installed. Here is what you can try:
# install.packages("devtools")
devtools::install_github("coatless/ucidata")
# load data
data("heart_disease_cl", package = "ucidata")
# show beginning rows of data
head(heart_disease_cl)
Output
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
1 63 Male typical angina 145 233 1 probable/definite hypertrophy 150 No 2.3 downsloping 0 fixed defect 0
2 67 Male asymptomatic 160 286 0 probable/definite hypertrophy 108 Yes 1.5 flat 3 normal 2
3 67 Male asymptomatic 120 229 0 probable/definite hypertrophy 129 Yes 2.6 flat 2 reversable defect 1
4 37 Male non-anginal pain 130 250 0 normal 187 No 3.5 downsloping 0 normal 0
5 41 Female atypical angina 130 204 0 probable/definite hypertrophy 172 No 1.4 upsloping 0 normal 0
6 56 Male atypical angina 120 236 0 normal 178 No 0.8 upsloping 0 normal 0
I had found another solution with RCurl :
library (RCurl)
download <- getURL("http://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")
data <- read.csv (text = download)
head(data)
#Output :
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine
1 75 0 582 0 20 1 265000 1.9
2 55 0 7861 0 38 0 263358 1.1
3 65 0 146 0 20 0 162000 1.3
4 50 1 111 0 20 0 210000 1.9
5 65 1 160 1 20 0 327000 2.7
6 90 1 47 0 40 1 204000 2.1
serum_sodium sex smoking time DEATH_EVENT
1 130 1 0 4 1
2 136 1 0 6 1
3 129 1 1 7 1
4 137 1 0 7 1
5 116 0 0 8 1
6 132 1 1 8 1

Assign groups with preassigned conditions in R

I have such a matrix(df1):
vid col1
103 9
103 3
103 7
103 6
104 7
104 8
104 9
105 6
105 8
106 8
106 9
106 4
106 6
I have another matrix (df2):
vid col1
103 0
104 1
105 5
106 3
I want to assign groups to df1 so that group IDs are conditional on df2 based on vid.
Namely, I would like the following manipulation on df1, desired output:
vid col1 col2
103 9 0
103 3 0
103 7 0
103 6 0
104 7 1
104 8 1
104 9 1
105 6 5
105 8 5
106 8 3
106 9 3
106 4 3
106 6 3
I was trying the following:
df1<-cbind(df1,0)
for (i in 1:nrow(df1)){
for(j in 1:nrow(df2))
{
if(df1[i,1]==df2[j,1]){
df1[i,3]=df2[j,2]
}
else{
df1[i,3]=NA
}
}
}
But this doesn't seem to work, could anyone help me with this please? Thanks!
You can use merge to merge the 2 dataframes together
merge(df1, df2, by='vid')
output would be
vid col1 col2
103 9 0
103 3 0
103 7 0
103 6 0
104 7 1
104 8 1
104 9 1
105 6 5
105 8 5
106 8 3
106 9 3
106 4 3
106 6 3
First, you should make a vector variable, this variable will be the third column of your final data frame.
Vec=vector()
I am assuming here, you have only 4 ID's which are
df2$vid
[1] 103 104 105 106
Now fill the third column while traversing all rows of first data frame df1.
for(i in 1:nrow(df1))
{
if(df1[i,1]==103){Vec[i]=df2[df2$vid==103,2]}
if(df1[i,1]==104){Vec[i]=df2[df2$vid==104,2]}
if(df1[i,1]==105){Vec[i]=df2[df2$vid==105,2]}
if(df1[i,1]==106){Vec[i]=df2[df2$vid==106,2]}
}
Finally, combine the third column with data frame df1.
df1=cbind(df1,Vec)

Counting Instances of Multiple Variables in R

I have a large data table Divvy (over 2.4 million records) that appears as such (some columns removed):
X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245
I would like to find the number of trips from each station to each opposing station. So for example,
From X To Y Sum
94 69 1
240 245 2
etc. and then join it back to the inital table using dplyr to make something like the below and then limit it to distinct from_station_id/to_combos, which I'll use to map routes (I have lat/long for each station):
X trip_id from_station_id.x to_station_id.x Sum
1 1109420 94 69 1
2 1109421 69 216 1
3 1109427 240 245 2
4 1109431 113 94 1
5 1109433 127 332 1
3 1109429 240 245 1
I successfully used count to get some of this, such as:
count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69)
x freq
1 FALSE 2454553
2 TRUE 81
But this is obviously labor intensive as there are 300 unique stations, so well over 44k poss combinations. I created a helper table thinking I could loop it.
n <- select(Divvy, from_station_id.y )
from_station_id.x
1 94
2 69
3 240
4 113
5 113
6 127
count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1])
x freq
1 FALSE 2454553
2 TRUE 81
I felt like a loop such as
output <- matrix(ncol=variables, nrow=iterations)
output <- matrix()
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))
should work but come to think of it that will still only return 300 rows, not 44k, so it would have to then loop back and do n[2] & n[1] etc...
I felt like there might also be a quicker dplyr solution that would let me return a count of each combo and append it directly without the extra steps/table creation, but I haven't found it.
I'm newer to R and I have searched around/think I'm close, but I can't quite connect that last dot of joining that result to Divvy. Any help appreciated.
#Here is the data.table solution, which is useful if you are working with large data:
library(data.table)
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe
X trip_id from_station_id.x to_station_id.x sum
1: 1 1109420 94 69 1
2: 2 1109421 69 216 1
3: 3 1109427 240 245 2
4: 4 1109431 113 94 1
5: 5 1109433 127 332 1
6: 3 1109429 240 245 2
Since you said "limit it to distinct from_station_id/to_combos", the following code seems to provide what you are after. Your data is called mydf.
library(dplyr)
group_by(mydf, from_station_id.x, to_station_id.x) %>%
count(from_station_id.x, to_station_id.x)
# from_station_id.x to_station_id.x n
#1 69 216 1
#2 94 69 1
#3 113 94 1
#4 127 332 1
#5 240 245 2
I'm not entirely sure that's what you're looking for as a result, but this calculates the number of trips having the same origin and destination. Feel free to comment and let me know if that's not quite what you expect as a final result.
dat <- read.table(text="X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245", header=TRUE)
dat$from.to <- paste(dat$from_station_id.x, dat$to_station_id.x, sep="-")
freqs <- as.data.frame(table(dat$from.to))
names(freqs) <- c("from.to", "sum")
dat2 <- merge(dat, freqs, by="from.to")
dat2 <- dat2[order(dat2$trip_id),-1]
Results
dat2
# X trip_id from_station_id.x to_station_id.x sum
# 6 1 1109420 94 69 1
# 5 2 1109421 69 216 1
# 3 3 1109427 240 245 2
# 4 3 1109429 240 245 2
# 1 4 1109431 113 94 1
# 2 5 1109433 127 332 1

R: Error while calculating Rolling Median and Rolling Mean

I am trying to calculate 3 period rolling means and rolling medians for the following data:
SiteID Month TotalSessions TotalMinutes
1 201401 132 1334
1 201402 159 2498
1 201403 98 734
1 201404 112 909
2 201402 25 220
2 201404 32 407
4 201401 10 77
4 201402 12 112
4 201403 9 59
However I am getting an when I use the following function:
ave(mydf$TotalSessions, mydf$SiteID, FUN = function(x) rollmedian(x,k=3, align = "right", na.pad = T))
Error: k <= n is not TRUE
I understand that the error is because that for some SiteIDs there are less than 3 periods of data and hence the rolling median is not getting calculated.
My question is, is there a way where I can add the missing months with 0s in TotalSessions and Total Minutes so that the data would look as follows:
SiteID Month TotalSessions TotalMinutes
1 201401 132 1334
1 201402 159 2498
1 201403 98 734
1 201404 112 909
2 201401 0 0
2 201402 25 220
2 201403 0 0
2 201404 32 407
4 201401 10 77
4 201402 12 112
4 201403 9 59
4 201404 0 0
Thanks for the help!
Personally I would use one of the solution proposed in the answer or in comments.
Here an answer to modify your data by adding 0 for missing months(the desired output). I mainly use merge function.
xx <- data.frame(Month=unique(dat$Month))
res <- do.call(rbind,
by(dat,dat$SiteID,function(x)merge(x,xx,all.y=TRUE)))
res[is.na(res)] <- 0
# Month SiteID TotalSessions TotalMinutes
# 1.1 201401 1 132 1334
# 1.2 201402 1 159 2498
# 1.3 201403 1 98 734
# 1.4 201404 1 112 909
# 2.1 201401 0 0 0
# 2.2 201402 2 25 220
# 2.3 201403 0 0 0
# 2.4 201404 2 32 407
# 4.1 201401 4 10 77
# 4.2 201402 4 12 112
# 4.3 201403 4 9 59
# 4.4 201404 0 0 0
Padding with NAs would be better, but even better than that is rollapply with partial = TRUE:
ave(mydf$TotalSessions, mydf$SiteID
, FUN = function(x) {rollapply(x, 3, median, align = "right", partial = TRUE)})

mlogit duplicate 'row.names' are not allowed

New to R and want to use mlogit function.
However after putting my data into a data frame and run
x <- mlogit.data(mlogit, choice="PlacedN", shape="long", alt.var="RaceID")
I get duplicate 'row.names' are not allowed
I can upload my file if needed I've spent days trying to get this to work, so any help will be appreciated
You may want to put "RaceID" into the alt.levels argument instead of alt.var. From the mlogit.data help file:
alt.levels
the name of the alternatives: if null, for a wide data.frame, they are guessed from the variable names and the choice variable (both should be the same), for a long data.frame, they are guessed from the alt.var argument.
Give this a try.
library(mlogit)
m <- read.csv("mlogit.csv")
mlogd <- mlogit.data(m, choice="PlacedN", shape="long", alt.levels="RaceID")
head(mlogd)
# RaceID PlacedN RSP TrA JoA aDS bDS mDS aDH bDH mDH LDH MR eMR
# 1.RaceID 20119552 TRUE 3.00 13 12 0 0 0 0 0 0 0 0 131
# 2.RaceID 20119552 FALSE 4.00 23 26 91 94 94 139 153 145 153 150 150
# 3.RaceID 20119552 FALSE 0.83 15 15 99 127 99 150 153 150 153 159 159
# 4.RaceID 20119552 FALSE 18.00 21 15 0 0 0 0 0 0 0 0 131
# 5.RaceID 20119552 FALSE 16.00 16 12 92 127 92 134 135 134 135 136 136
# 6.RaceID 20119617 TRUE 2.50 12 10 0 0 0 0 0 0 0 0 152

Resources