How to select columns in dataframe using variables - r

I have a dataframe (data in testdf dataframe included for replication) with 41 columns. I want to rearrange columns therein in a specific manner. This is because there can be variable number of columns in the original dataset due to the variable number of lags of variables used in the time series regression. The original dataset is then manipulated to include as many columns as required to make the total number of lags of the variables to be the same (7 in this case). Hence, in this dataset, there are 6 usage lags variables and 1 has been added (appears at the end).
The dataframe is shown below :-
region,regionalentity,entity,entitycode,dateperiod,dateperiodmax,avgtemp,avgtempcategory,adjavgtemp,adjavgtemptype,dayname,daynum,usage.lag0,intercept,usage.lag-1,usage.lag-2,usage.lag-3,usage.lag-4,usage.lag-5,usage.lag-6,temp.lag0,ciresL1,modelcomputedfittedvalues,usagelevel_L0,usagelevel_L1,fittedlevelusage,fittedlevelusagevar0,fittedlevelusagevarpct,fittedlevelusageabsvar0,fittedlevelusagevarabspct,adjfittedlevelusage,adjfittedlevelusagevar0,adjfittedlevelusagevarpct,adjfittedlevelusageabsvar0,adjfittedlevelusagevarabspct,adjfac,minvarpct,maxvarpct,adjminvarpct,adjmaxvarpct,usage_lag_7
VIC,VIC,VIC_TRU,6,2018-08-08,2019-08-26,12.8,Actual,12.8,Plus_0,Wed,3,-1978.630477847,1,5217.164445177,38381.403272784,-26573.993165182,-3571.086713581,-48.301188955,-865.165969976,1.75,37767.5832575731,-10546.6865414192,154724.804766449,156703.435244296,146156.748702877,8568.05606357215,5.5376098722537,8568.05606357215,5.5376098722537,146156.748702877,8568.05606357215,5.5376098722537,8568.05606357215,5.5376098722537,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-09,2019-08-26,10.5,Actual,10.5,Plus_0,Thu,4,6219.068623674,1,-1978.630477847,5217.164445177,38381.403272784,-26573.993165182,-3571.086713581,-48.301188955,-2.3,49140.1804479394,-4574.58388500458,160943.873390123,154724.804766449,150150.220881444,10793.6525086786,6.70646995211497,10793.6525086786,6.70646995211497,150150.220881444,10793.6525086786,6.70646995211497,10793.6525086786,6.70646995211497,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-10,2019-08-26,14.7,Actual,14.7,Plus_0,Fri,5,-47279.361890857,1,6219.068623674,-1978.630477847,5217.164445177,38381.403272784,-26573.993165182,-3571.086713581,4.2,37811.9212791045,-15456.0153096346,113664.511499266,160943.873390123,145487.858080488,-31823.3465812224,-27.9976099500748,31823.3465812224,27.9976099500748,145487.858080488,-31823.3465812224,-27.9976099500748,31823.3465812224,27.9976099500748,1,25,30,25,30,9999
VIC,VIC,VIC_TRU,6,2018-08-11,2019-08-26,11.4,Actual,11.4,Plus_0,Sat,6,34609.477278232,1,-47279.361890857,6219.068623674,-1978.630477847,5217.164445177,38381.403272784,-26573.993165182,-3.3,22575.5057919593,5349.94126323161,148273.988777498,113664.511499266,119014.452762498,29259.5360150004,19.7334247606353,29259.5360150004,19.7334247606353,119014.452762498,29259.5360150004,19.7334247606353,29259.5360150004,19.7334247606353,1,15,20,15,20,9999
VIC,VIC,VIC_TRU,6,2018-08-12,2019-08-26,10.1,Actual,10.1,Plus_0,Sun,7,181.193161194,1,34609.477278232,-47279.361890857,6219.068623674,-1978.630477847,5217.164445177,38381.403272784,-1.3,32008.3823244177,601.787165323653,148455.181938692,148273.988777498,148875.775942822,-420.594004129642,-0.283313791163811,420.594004129642,0.283313791163811,148875.775942822,-420.594004129642,-0.283313791163811,420.594004129642,0.283313791163811,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-13,2019-08-26,11.4,Actual,11.4,Plus_0,Mon,1,-11354.297567614,1,181.193161194,34609.477278232,-47279.361890857,6219.068623674,-1978.630477847,5217.164445177,1.3,22271.5206463676,-4603.22595618364,137100.884371078,148455.181938692,143851.955982508,-6751.07161143035,-4.92416343074627,6751.07161143035,4.92416343074627,143851.955982508,-6751.07161143035,-4.92416343074627,6751.07161143035,4.92416343074627,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-14,2019-08-26,13.05,Actual,13.05,Plus_0,Tue,2,-17233.144436292,1,-11354.297567614,181.193161194,34609.477278232,-47279.361890857,6219.068623674,-1978.630477847,1.65,20835.2779179977,-6148.02463599273,119867.739934786,137100.884371078,130952.859735085,-11085.1198002992,-9.24779244718395,11085.1198002992,9.24779244718395,130952.859735085,-11085.1198002992,-9.24779244718395,11085.1198002992,9.24779244718395,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-15,2019-08-26,14.95,Actual,14.95,Plus_0,Wed,3,12026.579924003,1,-17233.144436292,-11354.297567614,181.193161194,34609.477278232,-47279.361890857,6219.068623674,1.9,16190.4338545925,-1828.25322565582,131894.319858789,119867.739934786,118039.48670913,13854.8331496588,10.5044956935919,13854.8331496588,10.5044956935919,118039.48670913,13854.8331496588,10.5044956935919,13854.8331496588,10.5044956935919,1,10,15,10,15,9999
VIC,VIC,VIC_TRU,6,2018-08-16,2019-08-26,11.7,Actual,11.7,Plus_0,Thu,4,12449.922399102,1,12026.579924003,-17233.144436292,-11354.297567614,181.193161194,34609.477278232,-47279.361890857,-3.25,42712.6323897985,4460.21257151099,144344.242257891,131894.319858789,136354.5324303,7989.709827591,5.53517736669833,7989.709827591,5.53517736669833,136354.5324303,7989.709827591,5.53517736669833,7989.709827591,5.53517736669833,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-17,2019-08-26,11.8,Actual,11.8,Plus_0,Fri,5,-9762.010530065,1,12449.922399102,12026.579924003,-17233.144436292,-11354.297567614,181.193161194,34609.477278232,0.1,30367.4176907901,-4864.57340900852,134582.231727826,144344.242257891,139479.668848882,-4897.43712105646,-3.63899235298821,4897.43712105646,3.63899235298821,139479.668848882,-4897.43712105646,-3.63899235298821,4897.43712105646,3.63899235298821,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-18,2019-08-26,11.35,Actual,11.35,Plus_0,Sat,6,22305.952959846,1,-9762.010530065,12449.922399102,12026.579924003,-17233.144436292,-11354.297567614,181.193161194,-0.45,21368.3344560516,-1971.88265162283,156888.184687672,134582.231727826,132610.349076203,24277.8356114689,15.4746105704521,24277.8356114689,15.4746105704521,132610.349076203,24277.8356114689,15.4746105704521,24277.8356114689,15.4746105704521,1,15,20,15,20,9999
VIC,VIC,VIC_TRU,6,2018-08-19,2019-08-26,9.3,Actual,9.3,Plus_0,Sun,7,27244.1359885,1,22305.952959846,-9762.010530065,12449.922399102,12026.579924003,-17233.144436292,-11354.297567614,-2.05,40241.1145869285,-1906.75491557666,184132.320676172,156888.184687672,154981.429772095,29150.8909040767,15.8314905265021,29150.8909040767,15.8314905265021,154981.429772095,29150.8909040767,15.8314905265021,29150.8909040767,15.8314905265021,1,15,20,15,20,9999
VIC,VIC,VIC_TRU,6,2018-08-20,2019-08-26,8.95,Actual,8.95,Plus_0,Mon,1,-6677.68231343,1,27244.1359885,22305.952959846,-9762.010530065,12449.922399102,12026.579924003,-17233.144436292,-0.35,51845.2410212359,-11603.7793657006,177454.638362742,184132.320676172,172528.541310471,4926.0970522706,2.77597536909741,4926.0970522706,2.77597536909741,172528.541310471,4926.0970522706,2.77597536909741,4926.0970522706,2.77597536909741,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-21,2019-08-26,10.5,Actual,10.5,Plus_0,Tue,2,-14638.358924711,1,-6677.68231343,27244.1359885,22305.952959846,-9762.010530065,12449.922399102,12026.579924003,1.55,42497.3131741632,-16306.3388719971,162816.279438031,177454.638362742,161148.299490745,1667.97994728605,1.02445526518796,1667.97994728605,1.02445526518796,161148.299490745,1667.97994728605,1.02445526518796,1667.97994728605,1.02445526518796,1,0,5,0,5,9999
Therefore, the variables are defined as follows :-
maxusagelag <- 6 (this number can change depending upon the number of lags in the time series regression model selected)
maxpossibleusagelags <- 7
I want the data frame to be arranged like so :-
1. The first 14 columns as they are
2. The next 6 columns as they are - as 'maxusagelag' = 6 (there are 6 lagged variables of usage in the original data set)
3. Then the last variable (named "usage_lag_7") - 1 column in this case because (maxpossibleusagelags - maxusagelag = 1)
4. Then all of the remaining columns in the dataset excluding the last as it has already been moved to a different position in step 3 above
I tried a whole lot of options that I could think of but nothing worked. Here are some of the things that I tried :-
val1 <- ((ncol(testdf) - (maxpossibleusagelags - maxusagelag) + 1):ncol(testdf))
val1 : 41
val2 <- ((15 + maxpossibleusagelags - 1):(ncol(testdf) - (maxpossibleusagelags - maxusagelag)))
val2 : 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
val3 <- ((ncol(testdf) - (maxpossibleusagelags - maxusagelag)))
val3 : 40
paste0(val2, ":", val3)
[1] "21:40"
testdf1 <- testdf[, paste0("c(", val2, ":", val3, ")")]
testdf1 <- dplyr::select(testdf, paste0("c(", val2, ":", val3, ")")])
Error: unexpected ']' in "testdf1 <- dplyr::select(testdf, paste0("c(", val2, ":", val3, ")")]"
Is there something that I can do to select the various columns in a dataframe by position using variable names?

We could try using grep to create a regex to select columns dynamically and use setdiff to select remaining columns at the end.
first_set <- 1:14
second_set <- grep(paste0("lag\\.[1-",maxusagelag, "]"), names(testdf))
combine_set <- c(first_set, second_set, ncol(testdf))
and now use it to subset columns
testdf[,c(combine_set, setdiff(seq_along(testdf), combine_set))]

Related

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

How to arithmetically manipulate values in a single column in a data table?

Within a data.table is a column containing factors that I would like to manipulate arithmetically. I would like to sum the three values on the left side of each ratio, and sum the three numbers on the right side of the ratio, then return that summed value as a ratio. It's tricky to explain, but if I have this as part of a data table:
FattyAcid
1 4:0/16:0/16:0
2 16:0/16:0/18:1
3 18:1/14:0/18:1
I would then like to return in the data table
FattyAcid Assignment
1 4:0/16:0/16:0 36:0
2 16:0/16:0/18:1 50:1
3 18:1/14:0/18:1 50:2
i.e. for entry 1, (4 + 16 + 16):(0 + 0 + 0) = 36:0
When I call the dataset in the str function, it shows that the relevant column is: "Factor w/ 179 levels "(10:0/10:0/12:0)",..: 112 104 114 33 61 115 106 30 60 66 ..."
EDIT: I've found a solution, but it's not elegant. Basically I have to separate the values using tstrsplit() and paste them into new columns, which eventually generates six columns. Then convert them into numeric (from character), combine the relevant columns and then combine that result again. Then I just delete the old columns. I'm sure there's a better way but I guess it works :)
### split up the fatty acid factors into three columns separated by "/" i.e. individual ID'd fatty acids.
### also remove the starting and trailing brackets
setDT(LipidDataShortest)[, paste0("FattyAcid", 1:3) := tstrsplit(FattyAcid, "/")]
LipidDataShortest <- as.data.table(sapply(LipidDataShortest, gsub, pattern="[(]", replacement = ""))
LipidDataShortest <- as.data.table(sapply(LipidDataShortest, gsub, pattern="[)]", replacement = ""))
### small issue - also removes bracket from "FattyAcid" column. Way to remove only from specific columns?
### split up the specific fatty acids into number of carbons and number of double bonds
setDT(LipidDataShortest)[, paste0("FattyAcidOne", 1:2) := tstrsplit(FattyAcid1, ":")]
setDT(LipidDataShortest)[, paste0("FattyAcidTwo", 1:2) := tstrsplit(FattyAcid2, ":")]
setDT(LipidDataShortest)[, paste0("FattyAcidThree", 1:2) := tstrsplit(FattyAcid3, ":")]
### convert from character to numeric
LipidDataShortest$FattyAcidOne1 <- as.numeric(LipidDataShortest$FattyAcidOne1)
LipidDataShortest$FattyAcidOne2 <- as.numeric(LipidDataShortest$FattyAcidOne2)
LipidDataShortest$FattyAcidTwo1 <- as.numeric(LipidDataShortest$FattyAcidTwo1)
LipidDataShortest$FattyAcidTwo2 <- as.numeric(LipidDataShortest$FattyAcidTwo2)
LipidDataShortest$FattyAcidThree1 <- as.numeric(LipidDataShortest$FattyAcidThree1)
LipidDataShortest$FattyAcidThree2 <- as.numeric(LipidDataShortest$FattyAcidThree2)
### combine the columns to get total carbons and create new column for that, then repeat for alkenes
setDT(LipidDataShortest)[, paste0("Carbons", 1) := LipidDataShortest$FattyAcidOne1 + LipidDataShortest$FattyAcidTwo1 + LipidDataShortest$FattyAcidThree1 ]
setDT(LipidDataShortest)[, paste0("DoubleBonds", 1) := LipidDataShortest$FattyAcidOne2 + LipidDataShortest$FattyAcidTwo2 + LipidDataShortest$FattyAcidThree2 ]
### combine final assignments into new column and delete the unnecessary columns used to get to this point
LipidDataShortest$Assignment <- paste(LipidDataShortest$Carbons1, LipidDataShortest$DoubleBonds1, sep = ":")
LipidDataShortest <- LipidDataShortest[, -c(10:20)]

Problems subsetting columns based on values from two separate dataframes

I am using data obtained from a spatially gridded system, for example a city divided up into equally spaced squares (e.g. 250m2 cells). Each cell possesses a unique column and row number with corresponding numerical information about the area contained within this 250m2 square (say temperature for each cell across an entire city). Within the entire gridded section (or the example city), I have various study sites and I know where they are located (i.e. which cell row and column each site is located within). I have a dataframe containing information on all cells within the city, but I want to subset this to only contain information from the cells where my study sites are located. I previously asked a question on this 'Matching information from different dataframes and filtering out redundant columns'. Here is some example code again:
###Dataframe showing cell values for my own study sites
Site <- as.data.frame(c("Site.A","Site.B","Site.C"))
Row <- as.data.frame(c(1,2,3))
Column <- as.data.frame(c(5,4,3))
df1 <- cbind(Site,Row, Column)
colnames(df1) <- c("Site","Row","Column")
###Dataframe showing information from ALL cells
eg1 <- rbind(c(1,2,3,4,5),c(5,4,3,2,1)) ##Cell rows and columns
eg2 <- as.data.frame(matrix(sample(0:50, 15*10, replace=TRUE), ncol=5)) ##Numerical information
df2 <- rbind(eg1,eg2)
rownames(df2)[1:2] <- c("Row","Column")
From this, I used the answer from the previous questions which worked perfectly for the example data.
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,])]
However, I cannot apply this to my own data and cannot figure out why.
EDIT: Initially, I thought there was a problem with naming the columns (i.e. the 'names' function). But it would appear there may be an issue with the 'output' line of code, whereby columns are being included from df2 that shouldn't be (i.e. the output contained columns from df2 which possessed column and row numbers not specified within df1).
I have also tried:
output <- df2[, (df2['Row', ] == df1$Row) & (df2['Column', ] == df1$Column)]
But when using my own (seemingly comparable) data, I don't get information from all cells specified in the 'df1' equivalent (although again works fine in the example data above). I can get my own data to work if I do each study site individually.
SiteA <- df2[, which(df2['Row', ] == 1) & (df2['Column', ] == 5)]
SiteB <- df2[, which(df2['Row', ] == 2) & (df2['Column', ] == 4)]
SiteC <- df2[, which(df2['Row', ] == 3) & (df2['Column', ] == 3)]
But I have 1000s of sites and was hoping for a more succinct way. I am sure that I have maintained the same structure, double checked spellings and variable names. Would anyone be able to shed any light on potential things which I could be doing wrong? Or failing this an alternative method?
Apologies for not providing an example code for the actual problem (I wish I could pinpoint what the specific problem is, but until then the original example is the best I can do)! Thank you.
The only apparent issue I can see is that mapply is not wrapped around unlist. mapply returns a list, which is not what you're after for subsetting purposes. So, try:
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[unlist(mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,]))]
Edit:
If the goal is to grab columns whose first 2 rows match the 2nd and 3rd elements of a given row in df1, you can try the following:
output_df <- Filter(function(x) !all(is.na(x)), data.frame(do.call(cbind,apply(df2, 2, function(x) {
##Create a condition vector for an if-statement or for subsetting
condition <- paste0(x[1:2], collapse = "") == apply(df1[,c('Row','Column')], 1, function(y) {
paste0(y,collapse = "")
})
##Return a column if it meets the condition (first 2 rows are matched in df1)
if(sum(condition) != 0) {
tempdf <- data.frame(x)
names(tempdf) <- df1[condition,]$Site[1]
tempdf
} else {
##If they are not matched, then return an empty column
data.frame(rep(NA,nrow(df2)))
}
}))))
It is quite a condensed piece of code, so I hope the following explanation will help clarify some things:
This basically goes through every column in df2 (with apply(df2, 2, FUN)) and checks if its first 2 rows can be found in the 2nd and 3rd elements of every row in df1. If the condition is met, then it returns that column in a data.frame format with its column name being the value of Site in the matching row in df1; otherwise an empty column (with NA's) is returned. These columns are then bound together with do.call and cbind, and then coerced into a data.frame. Finally, we use the Filter function to remove columns whose values are NA's.
All that should give the following:
Site.A Site.B Site.C
1 2 3
5 4 3
40 42 33
13 47 25
23 0 34
2 41 17
10 29 38
43 27 8
31 1 25
31 40 31
34 12 43
43 30 46
46 49 25
45 7 17
2 13 38
28 12 12
16 19 15
39 28 30
41 24 30
10 20 42
11 4 8
33 40 41
34 26 48
2 29 13
38 0 27
38 34 13
30 29 28
47 2 49
22 10 49
45 37 30
29 31 4
25 24 31
I hope this helps.

How to create ID (by) before merging in R?

I have two dataframes df.o and df.m as defined below. I need to find which observation in df.o (dimension table) corresponds which observations in df.m (fact table) based on two criteria: 1) df.o$Var1==df.o$Var1 and df.o$date1 < df.m$date2 < df.o$date3 such that I get the correct value of df.o$oID in df.m$oID (the correct value is manually entered in df.m$CORRECToID). I need the ID to complete a merge afterwards.
df.o <- data.frame(oID=1:4,
Var1=c("a","a","b","c"),
date3=c(2015,2011,2014,2015),
date1=c(2013,2009,2012,2013),
stringsAsFactors=FALSE)
df.m <- data.frame(mID=1:3,
Var1=c("a","a","b"),
date2=c(2014,2010,2013),
oID=NA,
CORRECToID=c(1,2,3),
points=c(5, 10,15),
stringsAsFactors=FALSE)
I have tried various combinations of like the code below, but without luck:
df.m$oID[df.m$date2 < df.o$date3 & df.m$date2 > df.o$date1 & df.o$Var1==df.m$Var1] <- df.o$oID
I have also tried experimenting with various combinations of ifelse, which and match, but none seem to do the trick.
The problem I keep encountering is that my replacement was a different number of rows than data and that "longer object length is not a multiple of shorter object length".
What you are looking for is called an "overlap join", you could try the data.table::foverlaps function in order to achieve this.
The idea is simple
Create the columns to overlap on (add an additional column to df.m)
key by these columns
run foverlaps and select the column you want back
library(data.table)
setkey(setDT(df.m)[, date4 := date2], Var1, date2, date4)
setkey(setDT(df.o), Var1, date1, date3)
foverlaps(df.m, df.o)[, names(df.m), with = FALSE]
# mID Var1 date2 oID CORRECToID points date4
# 1: 2 a 2010 2 2 10 2010
# 2: 1 a 2014 1 1 5 2014
# 3: 3 b 2013 3 3 15 2013

lapply on single column in data frame

I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?
When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23

Resources