Reproduce a datset to different format in R

Reproduce a datset to different format in R - r

I have a dataset Data like below:
dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L,
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567",
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL",
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"),
IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN",
"BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN",
"Values", "IND"), class = "data.frame", row.names = c(NA, -9L
))
I wanted the above dataset to be converted as in the below format as a Data Frame(out_data).
Presently my Data has 3 columns - and need to covert these into 16 columns in below format.
I need to rehape my input - to exactly given in the screenshot as data frame.
I cannot change the below structure -
colnames(out_data) <- ("FN","H_BLK","S_N/R_N","B_N","FL_N","U_N","PC","XC","YC","BS","BRF","LCT_DEC","BRN","BO PN","S_TY_CD")
The Multiple value columns in the inputnand are always in the below Format:
|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567 -
|PC|H_BLK|S_N/R_N|XC|YC
|8121|B01|SOMERSET STN -> |BS|BRF|LCT_DEC
|SN30|SMRT ------> |BRN|BO
If the
IND =LOC - then |PC|H_BLK|S_N/R_N|XC|YC` get updated with S_TY_CD=LOC
IND= BN - then B_N column should be updated with S_TY_CD=BN
IND= RN - then _N/R_N column should be updated with S_TY_CD=RN
IND= BS then `|BS|BRF|LCT_DEC` should be updated with S_TY_CD=BS
IND= BR then `|BRN|BO` should be updated with S_TY_CD=BR
IND= PN then PN with S_TY_CD=PN
Is there an efficient way of doing this.

Here's one method of transformation. First I define some helper functions for the various sub problems.
#define out cols
outcols<-c("FN", "H_BLK", "S_N/R_N", "B_N", "FL_N", "U_N", "PC",
"XC", "YC", "BS", "BRF", "LCT_DEC", "BRN","BO","PN","S_TY_CD")
#identify parts for each compound value
namevals <- function(ind, vals) {
names<-if (ind=="LOC") {
c("PC","H_BLK","S_N/R_N","XC","YC")
} else if (ind=="BN") {
c("B_N")
} else if (ind=="RN") {
c("S_N/R_N")
} else if (ind=="BS") {
c("BS","BRF","LCT_DEC")
} else if (ind=="BR") {
c("BRN","BO")
} else if (ind=="PN") {
c("PN")
}
stopifnot(length(names)==length(vals))
stopifnot(all(names %in% outcols))
names(vals)<-names
vals
}
#add missing values for row
fillrow <- function(nvals) {
r<-rep(NA, length(outcols))
r[match(names(nvals), outcols)]<-nvals
r
}
Now I apply these to each row of the data with mapply to return a character vector. Here we make sure to split the "values" column on the pipe and remove the leading pipe.
#combine rows into character matrix
dt<-mapply(function(fn,vals,ind){
x<-c(FN=fn,namevals(ind, vals), "S_TY_CD"=ind)
fillrow(x)
},
as.character(Data$FN),
strsplit(gsub("^\\|","",as.character(Data$Values)),"|", fixed=T),
as.character(Data$IND)
)
Finally we tidy the data up so it can be written out to a file with write.table. Note that all missing values are true R NA values. In the write.table, you can set na = "" if you'd rather they print out as blank values than the default "NA" value.
#turn matrix into data.frame with proper names
dd<-data.frame(unname(t(dt)), stringsAsFactors=F)
names(dd)<-outcols
dd

Related

Conditional updating coordinate column in dataframe

I am attempting to populate two newly empty columns in a data frame with data from other columns in the same data frame in different ways depending on if they are populated.
I am trying to populate the values of HIGH_PRCN_LAT and HIGH_PRCN_LON (previously called F_Lat and F_Lon) which represent the final latitudes and londitudes for those rows this will be based off the values of the other columns in the table.
Case 1: Lat/Lon2 are populated (like in IDs 1 & 2), using the great
circle algorithm a midpoint between them should be calculated and
then placed into F_Lat & F_Lon.
Case 2: Lat/Lon2 are empty, then the values of Lat/Lon1 should be put
into F_Lat and F_Lon (like with IDs 3 & 4).
My code is as follows but doesn't work (see previous versions, removed in an edit).
The preperatory code I am using is as follows:
incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L,
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13",
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46",
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12",
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"),
Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"),
Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"),
Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L,
4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South",
"West"), class = "factor"), Weather.Category = structure(c(1L,
2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food",
"Humans", "Weather"), class = "factor"), Minutes = c(13L,
55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L,
150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727,
53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766,
52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256,
-2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073,
-1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187,
53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907,
NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984,
-1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA
#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])
#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <-
with(incidents [ind,,drop=FALSE],
do.call(rbind, geosphere::midPoint(cbind.data.frame(Location.1.Lon, Location.1.Lat), cbind.data.frame(Location.2.Lon, Location.2.Lat))))
#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]
I will use the geosphere::midPoint function based off a recommendation here: http://r.789695.n4.nabble.com/Midpoint-between-coordinates-td2299999.html.
Unfortunately, it doesn't appear that this way of populating the column will work when there are several cases.
The current error that is thrown is:
Error in `$<-.data.frame`(`*tmp*`, F_Lat, value = integer(0)) :
replacement has 0 rows, data has 178012
Edit: also posted to reddit: https://www.reddit.com/r/Rlanguage/comments/bdvavx/conditional_updating_column_in_dataframe/
Edit: Added clarity on the parts of the code I do not understand.
#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
with(dataframe[ind,,drop=FALSE],
#I am unclear on what do.call and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function.
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))

Though your code doesn't work as-written for me, and I cannot calculate the same precise values your expect, I suspect the error your seeing can be fixed with these steps. (Data is down at the bottom here.)
Pre-populate the empty columns.
Pre-calculate the complete.cases step, it'll save time.
Use cbind.data.frame for inside gcIntermediate.
I'm inferring from
gcIntermediate([dataframe...
^
this is an error in R
that you are binding those columns together, so I'll use cbind.data.frame. (Using cbind itself produced some ignorable warnings from geosphere, so you can use it instead and perhaps suppressWarnings, but that function is a little strong in that it'll mask other warnings as well.)
Also, since it appears you want one intermediate value for each pair of coordinates, I added the gcIntermediate(..., n=1) argument.
The use of do.call(rbind, ...) is because gcIntermediate returns a list, so we need to bring them together.
dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
ind <- complete.cases(dataframe[,4])
dataframe[ind, c("F_Lat2","F_Lon2")] <-
with(dataframe[ind,,drop=FALSE],
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
dataframe
# ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon F_Lat2 F_Lon2
# 1 1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
# 2 2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
# 3 3 33.02335 -5.098834 NA NA 33.02335 -5.098834 33.02335 -5.098834
# 4 4 54.87210 -4.400412 NA NA 54.87210 -4.400412 54.87210 -4.400412
Update, using your new incidents data and switching to geosphere::midPoint.
Try this:
incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
ind <- complete.cases(incidents[,4])
incidents[ind, c("F_Lat2","F_Lon2")] <-
with(incidents[ind,,drop=FALSE],
geosphere::midPoint(cbind.data.frame(Location.1.Lat,Location.1.Lon),
cbind.data.frame(Location.2.Lat,Location.2.Lon)))
incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
One (big) difference is that geosphere::gcIntermediate(..., n=1) returns a list of results, whereas geosphere::midPoint(...) (no n=) returns just a matrix, so no rbinding required.
Data:
dataframe <- read.table(header=T, stringsAsFactors=F, text="
ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon
1 19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
2 58.8721035 -1.4004125 54.746791 -4.47984 56.80944725 -2.94012625
3 33.0233529 -5.0988341 NA NA 33.0233529 -5.0988341
4 54.8721035 -4.4004125 NA NA 54.8721035 -4.4004125")

Subset list based on a condition in R

For a sample dataframe:
df <- structure(list(area = structure(c(1L, 4L, 3L, 8L, 5L, 7L, 6L,
2L), .Label = c("DE1", "DE3", "DE4", "DE5", "DE9", "DEA", "DEB",
"DEC"), class = "factor"), to.delete = c(1L, 0L, 1L, 0L, 1L,
1L, 1L, 0L)), .Names = c("area", "to.delete"), class = "data.frame", row.names = c(NA,
-8L))
I want to create a list of the areas which have a '1' in the 'to'delete' column. I know how to subset the 1s out of this dataframe, however I want the list of areas as eventually I will use this list to extract these areas from the main master data file (df2, listed below).
df2 <- structure(list(id = 1:24, area = structure(c(1L, 1L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 8L, 8L, 8L, 8L, 5L, 7L, 7L, 7L, 6L, 6L,
2L, 2L, 2L, 2L), .Label = c("DE1", "DE3", "DE4", "DE5", "DE9",
"DEA", "DEB", "DEC"), class = "factor")), .Names = c("id", "area"
), class = "data.frame", row.names = c(NA, -24L))
I prefer to do this in two steps, so I can easily see which areas I have deleted (thanks to answers below for suggestions of using list).
a <- list(df$area[df$to.delete == 1])
df2.subset <- df2[df2$area %in% a,]
This however doesn't seem to work at the moment, so if anyone has any ideas, then that would be great.
df2 should then be left with only areas DE5, DEC and DE3.
Many thanks.

Here is another method using split to collect the areas into two lists:
# get two lists of areas and give list items appropriate names
keepDrop <- setNames(split(df$area, df$to.delete), c("drop", "keep"))
# now perform dropping
df2.smaller <- df2[df2$area %in% keepDrop[["keep"]],]

We can use subset. Based on the description, the OP wants to subset the rows of a main data ('maindata') based on the 'area' that corresponds to 1 in 'to.delete' column. In that case, we extract the 'area' (df$area[df$to.delete ==1]) and with %in% we subset the 'maindata'.
subset(maindata, area %in% df$area[df$to.delete==1])

It's not too clear what you are asking.
This will create a list where each element is a different Area:
lapply(df$area[df$to.delete == 1], function(x) x)
If you want a list with just one element containing all the areas:
list(df$area[df$to.delete == 1])
Edit:
To answer the second part of your question:
a <- list(df$area[df$to.delete == 1])
df2.subset <- df2[!df2$area %in% a[[1]], ]

Here's what you can try .
a <- as.list(subset(df,df$to.delete == 1))
> a
$area
[1] DE1 DE4 DE9 DEB DEA
Levels: DE1 DE3 DE4 DE5 DE9 DEA DEB DEC
$to.delete
[1] 1 1 1 1 1

change the names for certain columns in a data frame [duplicate]

This question already has answers here:
Changing column names of a data frame
(18 answers)
Closed 7 years ago.
If I want to change the name from 2 column to the end , why my command does not work ?
fredTable <- structure(list(Symbol = structure(c(3L, 1L, 4L, 2L, 5L), .Label = c("CASACBM027SBOG",
"FRPACBW027SBOG", "TLAACBM027SBOG", "TOTBKCR", "USNIM"), class = "factor"),
Name = structure(1:5, .Label = c("bankAssets", "bankCash",
"bankCredWk", "bankFFRRPWk", "bankIntMargQtr"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Banks", class = "factor"),
Country = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "USA", class = "factor"),
Lead = structure(c(1L, 1L, 3L, 3L, 2L), .Label = c("Monthly",
"Quarterly", "Weekly"), class = "factor"), Freq = structure(c(2L,
1L, 3L, 3L, 4L), .Label = c("1947-01-01", "1973-01-01", "1973-01-03",
"1984-01-01"), class = "factor"), Start = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Current", class = "factor"), End = c(TRUE,
TRUE, TRUE, TRUE, FALSE), SeasAdj = c(FALSE, FALSE, FALSE,
FALSE, TRUE), Percent = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Fed", class = "factor"),
Source = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Res", class = "factor"),
Series = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Level",
"Ratio"), class = "factor")), .Names = c("Symbol", "Name",
"Category", "Country", "Lead", "Freq", "Start", "End", "SeasAdj",
"Percent", "Source", "Series"), row.names = c("1", "2", "3",
"4", "5"), class = "data.frame")
Then in order to change the second column name to the end I use the following order but does not work
names(fredTable[,-1]) = paste("case", 1:ncol(fredTable[,-1]), sep = "")
or
names(fredTable)[,-1] = paste("case", 1:ncol(fredTable)[,-1], sep = "")
In general how one can change column names of specific columns for example
2 to end, 2 to 7 and etc and set it as the name s/he like

Replace specific column names by subsetting on the outside of the function, not within the names function as in your first attempt:
> names(fredTable)[-1] <- paste("case", 1:ncol(fredTable[,-1]), sep = "")
Explanation
If we save the new names in a vector newnames we can investigate what is going on under the hood with replacement functions.
#These are the names that will replace the old names
newnames <- paste("case", 1:ncol(fredTable[,-1]), sep = "")
We should always replace specific column names with the format:
#The right way to replace the second name only
names(df)[2] <- "newvalue"
#The wrong way
names(df[2]) <- "newvalue"
The problem is that you are attempting to create a new vector of column names then assign the output to the data frame. These two operations are simultaneously completed in the correct replacement.
The right way [Internal]
We can expand the function call with:
#We enter this:
names(fredTable)[-1] <- newnames
#This is carried out on the inside
`names<-`(fredTable, `[<-`(names(fredTable), -1, newnames))
The wrong way [Internal]
The internals of replacement the wrong way are like this:
#Wrong way
names(fredTable[-1]) <- newnames
#Wrong way Internal
`names<-`(fredTable[-1], newnames)
Notice that there is no `[<-` assignment. The subsetted data frame fredTable[-1] does not exist in the global environment so no assignment for `names<-` occurs.

Extracting data frames from a list based on column names in r

I am looking at extracting df's from within a list of multiple df's into separate data frames based on a condition (if the column names of a df within the list contains the name I am looking for).
For illustration purposes I have created an example which resembles the situation I am in.
I have list with multiple data frames and the dput of that list is given below:
structure(list(V1 = structure(list(lvef = c(0.965686195194885,
0.0806777632648268, -0.531729196500083, -0.511913109608259, -0.413670941196816,
-0.0501899795864357, -0.337583918771946, 1.16086745780346, -0.478358865835724,
-1.95009138673888), hbc = c(-0.389950511350405, -0.904388183933348,
0.811821977223064, -0.868381700124344, -0.637307418402866, -1.04703715824204,
-0.394340445217658, -0.194653869597247, 0.00822402232044511,
-0.145032587618231), id = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = "NA", class = "factor")), .Names = c("lvef",
"hbc", "id"), row.names = c(NA, -10L), class = "data.frame"),
V2 = structure(list(ersta = c(-0.254360310986174, 0.3859806928747,
-0.135741797055127, 1.03929145413636, -0.484219739337178,
0.255476285148917, 1.0479422937128, 0.146613094683722, -0.914377222535014,
1.75052418161618, -0.275059500684816, 2.34861397588234, 0.00183723766664941,
0.97612891408903, 0.278868537504227, 0.456979477254684, 1.46323739326792,
0.664511602217853, 0.870420202897545, 1.38228375734407),
pgrsta = c(-1.49129812271989, 0.820330747101906, -0.0469488167129374,
0.471549380446308, -1.71312120132398, 0.0578140025416816,
1.67016363826724, 0.226180835709491, -2.00294530465909,
-0.0464857361954717, 0.306942902768782, -0.785096914460742,
0.283822632249141, -0.260774679911329, -1.2865970194309,
0.307972619170242, 0.223715024597144, -1.01642533651475,
-0.12229427204957, 0.223326519096996), id = structure(c(7L,
7L, 7L, 7L, 4L, 1L, 3L, 5L, 6L, 2L, 7L, 7L, 7L, 7L, 4L,
1L, 3L, 5L, 6L, 2L), class = "factor", .Label = c("-0.10863576856322",
"-0.317324527228699", "-0.422764348315332", "0.285132258310185",
"1.23305496219042", "1.39326602279981", "NA"))), .Names = c("ersta",
"pgrsta", "id"), row.names = c(NA, -20L), class = "data.frame"),
V3 = structure(list(hormrec = 1:15, event = structure(c(10L,
10L, 10L, 10L, 10L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"NA"), class = "factor")), .Names = c("hormrec", "event"), row.names = c(NA,
-15L), class = "data.frame"), V4 = structure(list(asat = c(-0.321423784000631,
0.181345361079582, 0.389158724418319, -1.15251833725336,
-0.351981383678293, -0.506888212379408, 0.870705917350059,
-0.626883041051641, -0.321843006223371, -0.674564527029912,
-0.609383943267379, -0.181661119817784, -1.63676077872658
), lab = structure(c(1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 2L), .Label = c("btest", "NA", "rtest"), class = "factor")), .Names = c("asat",
"lab"), row.names = c(NA, -13L), class = "data.frame")), .Names = c("V1",
"V2", "V3", "V4"))
I am trying to extract data frames from the list based on the condition that if a data frame within the list contains the column name/s required then that data frame from the list should go into a separate data frame. So far, I have been able to extract the data frames into a list using the following code:
# function to extract required df's
trial <- function(x)
{
reqname <- c("hbc","ersta") # column names to check for
data <- x
lapply(seq(data), function(i){ # loop through all the data frames in the list
y <- data.frame(data[[i]]) # extract df in y
names <- names(y) # extract names of df
for(a in 1:length(reqname)) # loop through the length of reqname
{
if(reqname[a]%in%names) # check if column name/s present in current df
{
z <- y # extract df into another df
return(z) # return df
}
}
}
)
}
The above function returns a list of matching df's along with nulls where there was not a match. I am looking for a modification so that the selected data frame comes out separately. If there are two df's matching the requirement then the output should be two separate data frames.
I will appreciate all and any help in finding a solution.

You can easily use the lapply() plus a custom function to identify wanted outputs. For instance, if k is your list,
trial <- function(x)
{
reqnames <- c("hbc","ersta")
k <- lapply(k, function(x) any(names(x) %in% reqnames))
k <- which(k==1)
x[k]
}
This outputs a list with only the dataframes containing at least one of the names in reqnames.

We can remove the NULL elements with Filter
lst1 <- Filter(length, trial(lst))
If we need multiple data.frame objects in the global environment, use list2env after renaming the list elements with the object names
names(lst1) <- paste0('dat' seq_along(lst1))
list2env(lst1, envir = .GlobalEnv)

Converting this ugly for-loop to something more R-friendly

Been using SO as a resource constantly for my work. Thanks for holding together such a great community.
I'm trying to do something kinda complex, and the only way I can think to do it right now is with a pair of nested for-loops (I know that's frowned upon in R)... I have records of three million-odd course enrollments: student UserID's paired with CourseID's. In each row, there's a bunch of data including start/end dates and scores and so forth. What I need to do is, for each enrollment, calculate the average score for that user across the courses she's taken before the course in the enrollment.
The code I'm using for the for-loop follows:
data$Mean.Prior.Score <- 0
for (i in as.numeric(rownames(data)) {
sum <- 0
count <- 0
for (j in as.numeric(rownames(data[data$UserID == data$UserID[i],]))) {
if (data$Course.End.Date[j] < data$Course.Start.Date[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
if (count != 0)
data$Mean.Prior.Score[i] <- sum / count
}
I'm pretty sure this would work, but it runs incredibly slowly... my data frame has over three million rows, but after a good 10 minutes of chugging, the outer loop has only run through 850 of the records. That seems way slower than the time complexity would suggest, especially given that each user has only 5 or 6 courses to her name on average.
Oh, and I should mention that I converted the date strings with as.POSIXct() before running the loop, so the date comparison step shouldn't be too terribly slow...
There's got to be a better way to do this... any suggestions?
Edit: As per mnel's request... finally got dput to play nicely. Had to add control = NULL. Here 'tis:
structure(list(Username = structure(1:20, .Label = c("100225",
"100226", "100228", "1013170", "102876", "105796", "106753",
"106755", "108568", "109038", "110150", "110200", "110350", "111873",
"111935", "113579", "113670", "117562", "117869", "118329"), class = "factor"),
User.ID = c(2313737L, 2314278L, 2314920L, 9708829L, 2325896L,
2315617L, 2314644L, 2314977L, 2330148L, 2315081L, 2314145L,
2316213L, 2317734L, 2314363L, 2361187L, 2315374L, 2314250L,
2361507L, 2325592L, 2360182L), Course.ID = c(2106468L, 2106578L,
2106493L, 5426406L, 2115455L, 2107320L, 2110286L, 2110101L,
2118574L, 2106876L, 2110108L, 2110058L, 2109958L, 2108222L,
2127976L, 2106638L, 2107020L, 2127451L, 2117022L, 2126506L
), Course = structure(c(1L, 7L, 10L, 15L, 11L, 19L, 4L, 6L,
3L, 12L, 2L, 9L, 17L, 8L, 20L, 18L, 13L, 16L, 5L, 14L), .Label = c("ACCT212_A",
"BIOS200_N", "BIS220_T", "BUSN115_A", "BUSN115_T", "CARD205_A",
"CIS211_A", "CIS275_X", "CIS438_S", "ENGL112_A", "ENGL112_B",
"ENGL227_K", "GM400_A", "GM410_A", "HUMN232_M", "HUMN432_W",
"HUMN445_A", "MATH100_X", "MM575_A", "PSYC110_Y"), class = "factor"),
Course.Start.Date = structure(c(1098662400, 1098662400, 1098662400,
1309737600, 1099267200, 1098662400, 1099267200, 1099267200,
1098662400, 1098662400, 1099267200, 1099267200, 1099267200,
1098662400, 1104105600, 1098662400, 1098662400, 1104105600,
1098662400, 1104105600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Term.ID = c(12056L, 12056L, 12056L, 66282L, 12057L, 12056L,
12057L, 12057L, 12056L, 12056L, 12057L, 12057L, 12057L, 12056L,
13469L, 12056L, 12056L, 13469L, 12056L, 13469L), Term.Name = structure(c(2L,
2L, 2L, 4L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L,
2L, 3L, 2L, 3L), .Label = c("Fall 2004", "Fall 2004 Session A",
"Fall 2004 Session B", "Summer Session A 2011"), class = "factor"),
Term.Start.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-10-21",
"2004-10-28", "2004-12-27", "2011-06-26"), class = "factor"),
Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125,
0, 0, 0, 0, 0), First.Course.Date = structure(c(1L, 1L, 1L,
4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L,
1L, 3L), .Label = c("2004-10-25", "2004-11-01", "2004-12-27",
"2011-07-04"), class = "factor"), First.Term.Date = structure(c(1L,
1L, 1L, 4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L,
1L, 3L, 1L, 3L), .Label = c("2004-10-21", "2004-10-28", "2004-12-27",
"2011-06-26"), class = "factor"), First.Timer = c(TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Course.Code = structure(c(1L,
6L, 9L, 13L, 9L, 17L, 4L, 5L, 3L, 10L, 2L, 8L, 15L, 7L, 18L,
16L, 11L, 14L, 4L, 12L), .Label = c("ACCT212", "BIOS200",
"BIS220", "BUSN115", "CARD205", "CIS211", "CIS275", "CIS438",
"ENGL112", "ENGL227", "GM400", "GM410", "HUMN232", "HUMN432",
"HUMN445", "MATH100", "MM575", "PSYC110"), class = "factor"),
Course.End.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-12-19",
"2005-02-27", "2005-03-26", "2011-08-28"), class = "factor")), .Names = c("Username",
"User.ID", "Course.ID", "Course", "Course.Start.Date", "Term.ID",
"Term.Name", "Term.Start.Date", "Score", "First.Course.Date",
"First.Term.Date", "First.Timer", "Course.Code", "Course.End.Date"
), row.names = c(NA, 20L), class = "data.frame")

I found that data.table worked well.
# Create some data.
library(data.table)
set.seed(1)
n=3e6
numCourses=5 # Average courses per student
data=data.table(UserID=as.character(round(runif(n,1,round(n/numCourses)))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
# test=function(CourseEndDate,Score,CourseStartDate) sapply(CourseStartDate, function(y) mean(Score[y>CourseEndDate]))
# I vastly reduced the number of comparisons with a better "test" function.
test2=function(CourseEndDate,Score,CourseStartDate) {
o.end = order(CourseEndDate)
run.avg = cumsum(Score[o.end])/seq_along(CourseEndDate)
idx=findInterval(CourseStartDate,CourseEndDate[o.end])
idx=ifelse(idx==0,NA,idx)
run.avg[idx]
}
system.time(data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1)
# For three million courses, at an average of 5 courses per student:
# user system elapsed
# 122.06 0.22 122.45
Running a test to see if it looks the same as your code:
set.seed(1)
n=1e2
data=data.table(UserID=as.character(round(runif(n,1,1000))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1
data["246"]
# UserID course Score CourseStartDate CourseEndDate MeanPriorScore
#1: 246 54 0.4531314 2000-08-09 2000-09-20 0.9437248
#2: 246 89 0.9437248 2000-02-19 2000-03-02 NA
# A comparison with your for loop (slightly modified)
data$MeanPriorScore.old<-NA # Set to NaN instead of zero for easy comparison.
# I think you forgot a bracket here. Also, There is no need to work with the rownames.
for (i in seq(nrow(data))) {
sum <- 0
count <- 0
# I reduced the complexity of figuring out the vector to loop through.
# It will result in the exact same thing if there are no rownames.
for (j in which(data$UserID == data$UserID[i])) {
if (data$CourseEndDate[j] <= data$CourseStartDate[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
# I had to add "[i]" here. I think that is what you meant.
if (count != 0) data$MeanPriorScore.old[i] <- sum / count
}
identical(data$MeanPriorScore,data$MeanPriorScore.old)
# [1] TRUE

This seems to be what you want
library(data.table)
# create a data.table object
DT <- data.table(data)
# key by userID
setkeyv(DT, 'userID')
# for each userID, where the Course.End.Date < Course.Start.Date
# return the mean score
# This is too simplistic
# DT[Course.End.Date < Course.Start.Date,
# list(Mean.Prior.Score = mean(Score)) ,
# by = list(userID)]
As per #jorans comment, this will be more complex than the code above.

This is only an outline of what I think a solution might entail. I'm going to use plyr just to illustrate the steps needed, for simplicity.
Let's just restrict ourselves to the case of one student. If we can calculate this for one student, extending it with some sort of split-apply will be trivial.
So let's suppose we have scores for a particular student, sorted by course end date:
d <- sample(seq(as.Date("2011-01-01"),as.Date("2011-01-31"),by = 1),100,replace = TRUE)
dat <- data.frame(date = sort(d),val = rnorm(100))
First, I think you'd need to summarise this by date and then calculate the cumulative running mean:
dat_sum <- ddply(dat,.(date),summarise,valsum = sum(val),n = length(val))
dat_sum$mn <- with(dat_sum,cumsum(valsum) / cumsum(n))
Finally, you'd merge these values back into the original data with the duplicate dates:
dat_merge <- merge(dat,dat_sum[,c("date","mn")])
I could probably write something that does this in data.table using an anonymous function to do all those steps, but I suspect others may be better able to do something that will be concise and fast. (In particular, I don't recommend actually tackling this with plyr, as I suspect it will still be extremely slow.)

I think something like this should work though it'd be helpful to have test data with multiple courses per user. Also might need +1 on the start dates in findInterval to make condition be End.Date < Start.Date instead of <=.
# in the test data, one is POSIXct and the other a factor
data$Course.Start.Date = as.Date(data$Course.Start.Date)
data$Course.End.Date = as.Date(data$Course.End.Date)
data = data[order(data$Course.End.Date), ]
data$Mean.Prior.Score = ave(seq_along(data$User.ID), data$User.ID, FUN=function(i)
c(NA, cumsum(data$Score[i]) / seq_along(i))[1L + findInterval(data$Course.Start.Date[i], data$Course.End.Date[i])])

With three million rows, maybe a database is helpful. Here an sqlite example which I believe creates something similar to your for loop:
# data.frame for testing
user <- sample.int(10000, 100)
course <- sample.int(10000, 100)
c_start <- sample(
seq(as.Date("2004-01-01"), by="3 months", length.ou=12),
100, replace=TRUE
)
c_end <- c_start + as.difftime(11, units="weeks")
c_idx <- sample.int(100, 1000, replace=TRUE)
enroll <- data.frame(
user=sample(user, 1000, replace=TRUE),
course=course[c_idx],
c_start=as.character(c_start[c_idx]),
c_end=as.character(c_end[c_idx]),
score=runif(1000),
stringsAsFactors=FALSE
)
#variant 1: for-loop
system.time({
enroll$avg.p.score <- NA
for (i in 1:nrow(enroll)) {
sum <- 0
count <- 0
for (j in which(enroll$user==enroll$user[[i]]))
if (enroll$c_end[[j]] < enroll$c_start[[i]]) {
sum <- sum + enroll$score[[j]]
count <- count + 1
}
if(count !=0) enroll$avg.p.score[[i]] <- sum / count
}
})
#variant 2: sqlite
system.time({
library(RSQLite)
con <- dbConnect("SQLite", ":memory:")
dbWriteTable(con, "enroll", enroll, overwrite=TRUE)
sql <- paste("Select e.user, e.course, Avg(p.score)",
"from enroll as e",
"cross join enroll as p",
"where e.user=p.user and p.c_end < e.c_start",
"group by e.user, e.course;")
res <- dbSendQuery(con, sql)
dat <- fetch(res, n=-1)
})
On my machine, sqlite is ten times faster. If that is not enough, it would be possible to index the database.

I can't really test this, as your data doesn't appear to satisfy the inequality in any combination, but I'd try something like this:
library(plyr)
res <- ddply(data, .(User.ID), function(d) {
with(subset(merge(d, d, by=NULL, suffixes=c(".i", ".j")),
Course.End.Date.j < Course.Start.Date.i),
c(Mean.Prior.Score = mean(Score.j)))
})
res$Mean.Prior.Score[is.nan(res$Mean.Prior.Score)] = 0
Here is how it works:
ddply: Group data by User.ID and execute function for each subset d of rows for one User.ID
merge: Create two copies of the data for one user, one with columns suffixed .i the other .j
subset: From this outer join, only select those matching the given inequality
mean: Compute the mean for the matched rows
c(…): Give a name to the resulting column
res: Will be a data.frame with columns User.ID and Mean.Prior.Score
is.nan: mean will return NaN for zero-length vectors, change these to zeros
I guess this might be reasonably fast if there are not too many rows for each User.ID. If this isn't fast enough, the data.tables mentioned in other answers might help.
Your code is a bit fuzzy on the desired output: you treat data$Mean.Prior.Score like a length-one variable, but assign to it in every iteration of the loop. I assume that this assignment is meant only for one row. Do you need means for every row of the data frame, or only one mean per user?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reproduce a datset to different format in R - r

Related

Conditional updating coordinate column in dataframe

Subset list based on a condition in R

change the names for certain columns in a data frame [duplicate]

Extracting data frames from a list based on column names in r

Converting this ugly for-loop to something more R-friendly

Categories

Resources