Populate vectors with 2 for loops - r

I may be misunderstanding how for loops work, but I'm having hard time comprehending why the current code doesn't populate vectors (the vectors evidently remain NA, although the code itself runs). I imagine there may also be a way to subset all of this information using ifelse(), but I'm experiencing "coder's block".
Issue (elaborated): I am trying to code a running Electoral College projection based on a betting market from the 2008 presidential cycle, over the final 90 days until Election Day. I justify using two for loops because the code needs to check conditional statements on a particular day and add a particular value to a preexisting sum at on that day. In other words, if the betting price for Obama is higher than McCain on a particular for a particular state that state's electoral votes are awarded to Obama on that day, and visa versa. Again, the code runs, but the vectors apparently remain NA.
Key of Relevant Variables
EV, electoral votes of that particular state
X, a unique value assigned to each observation
day, date class
PriceD, betting price for the Dem candidate
PriceR, betting price for the Rep candidate
DaysToEday, a numeric value indicating the difference between variable day and election day (2008-11-04)
Code in Question
Obama08.ECvotesByDay <- McCain08.ECvotesByDay <- rep(NA, 90)
for (i in 1:90) {
for (j in 1:nrow(subset(mpres08, mpres08$DaysToEday <= 90))){
if(mpres08$PriceD[j] > mpres08$PriceR[j]) {
Obama08.ECvotesByDay[i] <- Obama08.ECvotesByDay[i]+mpres08$EV[j]
}
else {
McCain08.ECvotesByDay[i] <- McCain08.ECvotesByDay[i]+mpres08$EV[j]
}
}
}
dput of Data (five rows)
structure(list(state = c("AK", "AK", "AK", "AK", "AK"), state.name = c("Alaska",
"Alaska", "Alaska", "Alaska", "Alaska"), Obama = c(38L, 38L,
38L, 38L, 38L), McCain = c(59L, 59L, 59L, 59L, 59L), EV = c(3L,
3L, 3L, 3L, 3L), X = c(24073L, 25195L, 8773L, 25603L, 25246L),
day = structure(c(13937, 13959, 13637, 13967, 13960), class = "Date"),
PriceD = c(7.5, 7.5, 10, 8, 7.5), VolumeD = c(0L, 0L, 0L,
0L, 0L), PriceR = c(92.5, 92.5, 90, 92, 92.5), VolumeR = c(0L,
0L, 0L, 0L, 0L), DaysToEday = c(250, 228, 550, 220, 227)), row.names = c(NA,
5L), class = "data.frame")

You are adding a number to NA, and for R the result is NA.

Obama08.ECvotesByDay[i] and McCain08.ECvotesByDay[i] are initialised with NA. In R, if you try to do arithmetic with NA it stays NA (e.g. NA + 1 results in NA). Depending on what is a neutral result for you, you could initialise the vectors in the beginning with 0:
Obama08.ECvotesByDay <- McCain08.ECvotesByDay <- rep(0, 90)

Related

Select the optimal number based on conditions

This is my minimal dataset:
df=structure(list(ID = c(3942504L, 3199413L, 1864266L, 4037617L,
2030477L, 1342330L, 5434070L, 3200378L, 4810153L, 4886225L),
MI_TIME = c(1101L, 396L, 1140L, 417L, 642L, 1226L, 1189L,
484L, 766L, 527L), MI_Status = c(0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 1L, 0L), Stroke_status = c(1L, 0L, 1L, 0L, 0L, 0L,
0L, 1L, 1L, 0L), Stroke_time = c(1101L, 396L, 1140L, 417L,
642L, 1226L, 1189L, 484L, 766L, 527L), Arrhythmia_status = c(NA,
NA, TRUE, NA, NA, TRUE, NA, NA, TRUE, NA), Arrythmia_time = c(1101L,
356L, 1122L, 7L, 644L, 126L, 118L, 84L, 76L, 5237L)), row.names = c(NA,
10L), class = "data.frame")
As you can see, I have mainly 2 types of variables "_status" and "_time".
I am preparing my dataset for a survival analysis, and "time" is time to event in days.
But the problem arrives when I try to create a variable called "any cardiovascular outcome" (df$CV) That I have defined as the following:
df$CV = NA
df$CV <- with(df, ifelse(MI_Status=='1' | Stroke_status=='1' | Arrhythmia_status== 'TRUE' ,'1', '0'))
df$CV = as.factor(df$CV)
The problem I have is with selecting the optimal time to event. As now I have a new variable called df$CV, but 3 different "_time" variables.
So I would like to create a new column, called df$CV_time where time, is the time for the event that happened first.
There is a slight difficulty in this problem though, and I put an example:
If we have a subject with MI_status==1, Arrythmia_status==NA, stroke_status==1 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 200, as it is the time for the earliest event.
However, in a case where MI_status==0, Arrythmia_status==NA, stroke_status==0 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 220, as it is the time for latest follow up is 220 days.
How could I select the optimal number for df$CV based on these conditions?
This might be one approach using tidyverse.
First, you may want to make sure your column names are consistent with spelling and case (here using rename).
Then, you can explicitly define your "Arrhythmia" outcome as TRUE or FALSE (instead of using NA).
You can put your data into long form with pivot_longer, and then group_by your ID. You can include the specific columns related to MI, stroke, and arrhythmia here (where there are "time" and "status" columns available). Note that in your actual dataset (where you use glimpse - it is unclear what you want for arrhythmia - there's a pif column name, but nothing specific for time or status).
Your cardiovascular outcome will include status for MI or Stroke that is 1, or Arrhythmia that is TRUE.
The time to event would be the min time if there was a cardiovascular outcome, otherwise use the censored time of latest follow up or max time.
Let me know if this gives you the desired output.
library(tidyverse)
df %>%
rename(MI_time = MI_TIME, MI_status = MI_Status, Arrhythmia_time = Arrythmia_time) %>%
replace_na(list(Arrhythmia_status = F)) %>%
pivot_longer(cols = c(starts_with("MI_"), starts_with("Stroke_"), starts_with("Arrhythmia_")),
names_to = c("event", ".value"),
names_sep = "_") %>%
group_by(ID) %>%
summarise(
any_cv_outcome = any(status[event %in% c("MI", "Stroke")] == 1 | status[event == "Arrhythmia"]),
cv_time_to_event = ifelse(any_cv_outcome, min(time), max(time))
)
Output
ID any_cv_outcome cv_time_to_event
<int> <lgl> <int>
1 1342330 TRUE 126
2 1864266 TRUE 1122
3 2030477 FALSE 644
4 3199413 FALSE 396
5 3200378 TRUE 84
6 3942504 TRUE 1101
7 4037617 FALSE 417
8 4810153 TRUE 76
9 4886225 FALSE 5237
10 5434070 FALSE 1189

How do I avoid using for-loops

I am currently working on listening data of a music platform in R.
I have a subset (listening.subset) of the total data set. It contains 6 columns (USER, artist, Week, COUNT, user_type, binary).
Each user can either be a focal user, a friend, or a neighbour. There are separate data sets that link focal users to their friends (friend.data) and neighbours (neighbour.data), but I added a column to indicate the type of user.
Now, I have the following for-loop to indicate whether a friend has listened to an artist in the 10 weeks before the focal user has listened to that same artist. If that is the case, the binary column must show a 0, else a 1.
listening.subset$binary <- NA
for (i in 1:count(listening.subset)$n) {
test_user <- listening.subset[i,]
test_week <- test_user$Week
test_artist <- test_user$artist
if (test_user$user_type == "friend") {
foc <- vlookup(test_user$USER, friend.data, result_column = 1, lookup_column = 2)
prior_listen <- listening.subset %>% filter(USER == foc) %>% group_by(artist) %>% filter(test_week >= (Week -10) & test_week <= Week) %>% filter(artist == test_artist)
if (nrow(prior_listen) > 0) {
listening.subset[i,]$binary <- 0
}
else(
listening.subset[i,]$binary <- 1)
}
}
The problem with this for-loop is that it takes too long to apply to the full data set. Therefore, I want to apply vectorization. However, This concept is vague to me and after reading up on it online, I still do not have a clue as to how I should adjust my code.
I hope someone knows how to use vectorization and could help me.
EDIT1: the total data set contains around 50 million entries. However, I could split it up in 10 data sets of 5 million each.
EDIT2: listening.subset:
"clubanddeform", "HyprMusic", "Peter-182", "komosionmel", "SHHitsKaty",
"Sonik_Villa", "Haalf"), artist = c("Justin Timberlake", "Ediya",
"Lady Gaga", "El Guincho", "Lighthouse Family", "Pidżama Porno",
"The Men", "Modest Mouse", "Com Truise", "April Smith and The Great Picture Show"
), Week = c(197L, 213L, 411L, 427L, 443L, 232L, 431L, 312L, 487L,
416L), COUNT = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 6L, 11L), user_type = c("friend",
"friend", "friend", "friend", "neighbour", "friend", "neighbour",
"friend", "focal", "friend"), binary = c(1, 1, 1, 1, NA, 1, NA,
1, NA, 1)), row.names = c(NA, 10L), class = "data.frame")
Where Week is an indicator for which week the user listened to the particular band (ranging between 1 and 527), and COUNT equals the amount of times the user has listened to that artist in that particular week.
Recap: The binary variable should indicate whether the "friend user" has listened to the same band as the "focal user", in the 10 weeks before the focal user played the band. The social connections can be found in the friend.data, which is depicted below.
structure(list(USER = c("TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner", "TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner"), FRIEND = c("npetrovay", "marno25",
"lennonstarr116", "sachmonkey", "andrewripp", "daledrops", "Skittlebite",
"Ego_Trippin", "pistolgal5", "jjollett")), row.names = c(NA,
10L), class = "data.frame")
For each 190 focal users (first column), the friends are listed next to it, in the second column.

Removing duplicates of same date and location (3 columns) in R

I know there are like a million questions regarding duplicate removal, but unfortunately
none of them helped me so far. I struggle with the following:
I have a data frame (loc) that includes data of citizen science observations of nature (animals, plants, etc.). It has about 90.000 rows and looks like this:
ID Datum lat long Anzahl Art Gruppe Anrede Wochentag
1 1665376475 2019-05-09 51.30993 9.319896 20 Alytes obstetricans Amphibien Herr Do
2 529728479 2019-05-06 50.58524 8.503332 1 Alytes obstetricans Amphibien Frau Mo
3 1579862637 2019-05-23 50.53925 8.467546 8 Alytes obstetricans Amphibien Herr Do
4 -415013306 2019-05-06 50.58524 8.503332 3 Alytes obstetricans Amphibien Frau Mo
I also made a small sample data frame (loc_sample) of 10 observations and used dput(loc_sample):
structure(list(ID = c(688380991L, -1207894879L, 802295973L, -815104336L, -632066829L, -133354744L, 1929856503L, 952982037L, 1782222413L, 1967897802L),
Datum = structure(c(1559088000, 1558742400, 1557619200, 1557273600, 1557187200, 1557619200, 1557619200, 1557187200, 1557964800, 1556841600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")),
lat = c(52.1236088700115, 51.5928822313012, 53.723426877949, 50.7737623304861, 49.9238597947287, 51.805563222817, 50.1738326622472, 51.2763067511127, 51.395189306337, 51.5732959108075),
long = c(8.62399927116144, 9.89597797393799, 9.04058595819038, 8.20740532922287, 8.29073164862348, 9.9225640296936, 8.79065646492143, 6.40700340270996, 6.47360801696777, 6.25690012620748),
Anzahl = c(2L, 25L, 4L, 1L, 1L, 30L, 2L, 1L, 1L, 1L),
Art = c("Sturnus vulgaris", "Olethreutes arcuella", "Sylvia atricapilla", "Buteo buteo", "Turdus merula", "Orchis mascula subsp. mascula", "Parus major", "Luscinia megarhynchos", "Milvus migrans", "Andrena bicolor"),
Gruppe = c("Voegel", "Schmetterlinge", "Voegel", "Voegel", "Voegel", "Pflanzen", "Voegel", "Voegel", "Voegel", "InsektenSonstige"),
Anrede = c("Herr", "Herr", "Frau", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr"),
Wochentag = structure(c(4L, 7L, 1L, 4L, 3L, 1L, 1L, 3L, 5L, 6L),
.Label = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
class = c("ordered", "factor"))),
row.names = c(NA, -10L),
class = "data.frame")
For my question only the variables Datum, latand long are important. Datum is a date and in the POSIXct format while lat and long are both numeric. There are quite a few observations that were reported on the same day from the exact same location. I would like to filter and remove those. So I have to check three separate columns and keep only one of each "same-place-same-day" observations.
I already tried putting the three variables in question into one:
loc$dupl <- paste(loc$Datum, loc$lat, loc$long, sep=" ,")
locu <- unique(loc[,2:4])
It seems like I managed to filter the duplicates, but I'm actually not sure, if that's how it is done correctly.
Also, that gives me a data frame with only Datum, lat and long. As a final result I need the original data frame without the duplicates in date and location, but with all the other information for the unique rows still left.
When I try:
locu <- unique(loc[,2:9])
It gives me all the other columns, but it doesn't remove the date and location duplicates.
Thanks in advance for your help!
This can work:
#Code
new <- loc[!duplicated(paste(loc$Datum,loc$lat,loc$long)),]
To get the full data frame back after finding the duplicates, you coudl do sth. like:
loc[!duplicated(loc[,2:4]),]
This code first detects the duplicate rows and then subsets your original data frame.
Note: this code will always keep the first occurences and delete the duplicates in subsequent rows. If you want to keep a certain ID (e.g. the second one, not the first one), we need a different solution.

How to efficiently select a random sample of variables from a set of variables in a dataframe

I would appreciate any help to randomly select a subset of var.w_X
containing 5 out of 10 var.w_X variables from my sample data sampleDT, while keeping all the other variables that do not start withvar.w_.
Below is the sample data sampleDT which contains, among other variables (those to be kept altogether), X variables starting with var.w_ in their names (those from which to draw the random sample).
In the current example, X=10, so that var.w_ includes var.w_1 to var.w_10, and I want to draw a random sample of 5 out of these 10. However, in my actual data, X>1,000,000and I might want to draw a sample of 7,500 var.w_ variables out of these X>1,000,000.
Therefore, accounting for efficiency is paramount in any given solution since recently I experienced some performance issues with mutate_at whose cause I still don't have an explanation.
Importantly, the other variables to keep (those that do not start with var.w_) are not guaranteed to stay in any pre-specified order, as they might be located before and/or between and/or after the var.w_ variables, for example. So solutions that rely on order of columns will not work.
#sample data
sampleDT<-structure(list(n = c(62L, 96L, 17L, 41L, 212L, 143L, 143L, 143L,
73L, 73L), r = c(3L, 1L, 0L, 2L, 170L, 21L, 0L, 33L, 62L, 17L
), p = c(0.0483870967741935, 0.0104166666666667, 0, 0.0487804878048781,
0.80188679245283, 0.146853146853147, 0, 0.230769230769231, 0.849315068493151,
0.232876712328767), var.w_8 = c(1.94254385942857, 1.18801169942857,
3.16131123942857, 3.16131123942857, 1.13482609242857, 1.13042157942857,
2.13042157942857, 1.13042157942857, 1.12335579942857, 1.12335579942857
), var.w_9 = c(1.942365288, 1.187833128, 3.161132668, 3.161132668,
1.134647521, 1.130243008, 2.130243008, 1.130243008, 1.123177228,
1.123177228), var.w_10 = c(1.94222639911111, 1.18769423911111,
3.16099377911111, 3.16099377911111, 1.13450863211111, 1.13010411911111,
2.13010411911111, 1.13010411911111, 1.12303833911111, 1.12303833911111
), group = c(1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L,
0L, 0L), treat = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L), c1 = c(1.941115288,
1.186583128, 1.159882668, 1.159882668, 1.133397521, 1.128993008,
1.128993008, 1.128993008, 1.121927228, 1.121927228), var.w_6 = c(1.939115288, 1.184583128,
3.157882668, 3.157882668, 1.131397521, 1.126993008, 2.126993008,
1.126993008, 1.119927228, 1.119927228), var.w_7 = c(1.94278195466667,
1.18824979466667, 3.16154933466667, 3.16154933466667, 1.13506418766667,
1.13065967466667, 2.13065967466667, 1.13065967466667, 1.12359389466667,
1.12359389466667), c2 = c(0.1438,
0.237, 0.2774, 0.2774, 0.2093, 0.1206, 0.1707, 0.0699, 0.1351,
0.1206), var.w_1 = c(1.941115288, 1.186583128, 3.159882668, 3.159882668,
1.133397521, 1.128993008, 2.128993008, 1.128993008, 1.121927228,
1.121927228), var.w_2 = c(1.931115288, 1.176583128, 3.149882668,
3.149882668, 1.123397521, 1.118993008, 2.118993008, 1.118993008,
1.111927228, 1.111927228), var.w_3 = c(1.946115288, 1.191583128,
3.164882668, 3.164882668, 1.138397521, 1.133993008, 2.133993008,
1.133993008, 1.126927228, 1.126927228), var.w_4 = c(1.93778195466667,
1.18324979466667, 3.15654933466667, 3.15654933466667, 1.13006418766667,
1.12565967466667, 2.12565967466667, 1.12565967466667, 1.11859389466667,
1.11859389466667), var.w_5 = c(1.943615288, 1.189083128, 3.162382668,
3.162382668, 1.135897521, 1.131493008, 2.131493008, 1.131493008,
1.124427228, 1.124427228)), class = "data.frame", row.names = c(NA, -10L))
#my attempt
//based on the comment by #akrun - this does not keep the other variables as specified above
myvars <- sample(grep("var\\.w_", names(sampleDT), value = TRUE), 5)
sampleDT_test <- sampleDT[myvars]
Thanks in advance for any help
Apologies, had to step into a meeting for a little bit. So, I think you could adapt akrun's solution and keep the first columns for the sample dataframe. Let me know how this scales on the full dataframe. Also, thanks for clarifying further.
> # Subsetting the variable names not matching your pattern using grepl
> names(sampleDT)[!grepl("var\\.w_", names(sampleDT))]
[1] "n" "r" "p" "group" "treat" "c1" "c2"
>
> # Combine that with akrun's solution
> myvars <- c(names(sampleDT)[!grepl("var\\.w_", names(sampleDT))],
+ sample(grep("var\\.w_", names(sampleDT), value = TRUE), 5))
> head(sampleDT[myvars])
n r p group treat c1 c2 var.w_6 var.w_1 var.w_4 var.w_3 var.w_8
1 62 3 0.04838710 1 0 1.941115 0.1438 1.939115 1.941115 1.937782 1.946115 1.942544
2 96 1 0.01041667 1 0 1.186583 0.2370 1.184583 1.186583 1.183250 1.191583 1.188012
3 17 0 0.00000000 0 0 1.159883 0.2774 3.157883 3.159883 3.156549 3.164883 3.161311
4 41 2 0.04878049 1 0 1.159883 0.2774 3.157883 3.159883 3.156549 3.164883 3.161311
5 212 170 0.80188679 0 0 1.133398 0.2093 1.131398 1.133398 1.130064 1.138398 1.134826
6 143 21 0.14685315 1 1 1.128993 0.1206 1.126993 1.128993 1.125660 1.133993 1.130422

How can you loop this higher-order function in R?

This question relates to the reply I received here with a nice little function from thelatemail.
The dataframe I'm using is not optimal, but it's what I've got and I'm simply trying to loop this function across all rows.
This is my df
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
####################
#what it looks like#
####################
require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1 Healthcare 464 73 75
2 Community 285 40 37
3 Contaminant 24 26 18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
(int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
hosp2_CathAssocType (int)
The function I have is to perform a chisq.test across all categories in df$Type. Ideally the function should switch to a fisher.test() if the cell count is <5, but that's a separate issue (extra brownie points for the person who comes up with how to do that though).
This is the function I'm using to go row by row
func <- Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[1,],colSums(out[2:3,]))
chisq <- chisq.test(final,correct=FALSE)
chisq$p.value
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
func
But ideally, i'd want it to be something like this
for(i in 1:nrow(df)){func}
But that doesn't work. A further hook is, that when for example, row two is taken, the final call looks like this
func <- Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[2,],colSums(out[c(1,3),]))
chisq <- chisq.test(final,correct=FALSE)
chisq$p.value
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
func
so the function should understand that the cell count its taking for out[x,] has to be excluded from colSums(). This data.frame only has 3 rows, so it's easy, but I've tried applying this function to a separate data.frame I have that consists >200 rows, so it would be nice to be able to loop this somehow.
Any help appreciated.
Cheers
You were missing two things:
To select the line i and select all but this line you want to use
u[i] and u[-i]
If an item is not the same length than the others given to Map, it is recycled, a very general property of the language. You then just have to add an argument to the function that corresponds to the line you want to oppose to the others, it will be recycled for all the items of the vectors passed.
The following does what you asked for
# the function doing the stats
FisherOrChisq <- function(x,y,lineComp) {
out <- cbind(x,y)
final <- rbind(out[lineComp,],colSums(out[-lineComp,]))
test <- chisq.test(final,correct=FALSE)
return(test$p.value)
}
# test of the stat function
FisherOrChisq(SO_Example_v1[grep("^hosp1",names(SO_Example_v1))[1]],
SO_Example_v1[grep("^hosp2",names(SO_Example_v1))[1]],2)
# making the loop
result <- c()
for(type in SO_Example_v1$Type){
line <- which(SO_Example_v1$Type==type)
res <- Map(FisherOrChisq,
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))],
line
)
result <- rbind(result,res)
}
colnames(result) <- gsub("^hosp[0-9]+","",colnames(result))
rownames(result) <- SO_Example_v1$Type
That said, what you are doing is very heavy multiple testing. I would be extremely cautious with the use of the corresponding p-values, you need at least to use a multiple testing correction such as what is suggested here.

Resources