Problems subsetting columns based on values from two separate dataframes - r

I am using data obtained from a spatially gridded system, for example a city divided up into equally spaced squares (e.g. 250m2 cells). Each cell possesses a unique column and row number with corresponding numerical information about the area contained within this 250m2 square (say temperature for each cell across an entire city). Within the entire gridded section (or the example city), I have various study sites and I know where they are located (i.e. which cell row and column each site is located within). I have a dataframe containing information on all cells within the city, but I want to subset this to only contain information from the cells where my study sites are located. I previously asked a question on this 'Matching information from different dataframes and filtering out redundant columns'. Here is some example code again:
###Dataframe showing cell values for my own study sites
Site <- as.data.frame(c("Site.A","Site.B","Site.C"))
Row <- as.data.frame(c(1,2,3))
Column <- as.data.frame(c(5,4,3))
df1 <- cbind(Site,Row, Column)
colnames(df1) <- c("Site","Row","Column")
###Dataframe showing information from ALL cells
eg1 <- rbind(c(1,2,3,4,5),c(5,4,3,2,1)) ##Cell rows and columns
eg2 <- as.data.frame(matrix(sample(0:50, 15*10, replace=TRUE), ncol=5)) ##Numerical information
df2 <- rbind(eg1,eg2)
rownames(df2)[1:2] <- c("Row","Column")
From this, I used the answer from the previous questions which worked perfectly for the example data.
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,])]
However, I cannot apply this to my own data and cannot figure out why.
EDIT: Initially, I thought there was a problem with naming the columns (i.e. the 'names' function). But it would appear there may be an issue with the 'output' line of code, whereby columns are being included from df2 that shouldn't be (i.e. the output contained columns from df2 which possessed column and row numbers not specified within df1).
I have also tried:
output <- df2[, (df2['Row', ] == df1$Row) & (df2['Column', ] == df1$Column)]
But when using my own (seemingly comparable) data, I don't get information from all cells specified in the 'df1' equivalent (although again works fine in the example data above). I can get my own data to work if I do each study site individually.
SiteA <- df2[, which(df2['Row', ] == 1) & (df2['Column', ] == 5)]
SiteB <- df2[, which(df2['Row', ] == 2) & (df2['Column', ] == 4)]
SiteC <- df2[, which(df2['Row', ] == 3) & (df2['Column', ] == 3)]
But I have 1000s of sites and was hoping for a more succinct way. I am sure that I have maintained the same structure, double checked spellings and variable names. Would anyone be able to shed any light on potential things which I could be doing wrong? Or failing this an alternative method?
Apologies for not providing an example code for the actual problem (I wish I could pinpoint what the specific problem is, but until then the original example is the best I can do)! Thank you.

The only apparent issue I can see is that mapply is not wrapped around unlist. mapply returns a list, which is not what you're after for subsetting purposes. So, try:
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[unlist(mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,]))]
Edit:
If the goal is to grab columns whose first 2 rows match the 2nd and 3rd elements of a given row in df1, you can try the following:
output_df <- Filter(function(x) !all(is.na(x)), data.frame(do.call(cbind,apply(df2, 2, function(x) {
##Create a condition vector for an if-statement or for subsetting
condition <- paste0(x[1:2], collapse = "") == apply(df1[,c('Row','Column')], 1, function(y) {
paste0(y,collapse = "")
})
##Return a column if it meets the condition (first 2 rows are matched in df1)
if(sum(condition) != 0) {
tempdf <- data.frame(x)
names(tempdf) <- df1[condition,]$Site[1]
tempdf
} else {
##If they are not matched, then return an empty column
data.frame(rep(NA,nrow(df2)))
}
}))))
It is quite a condensed piece of code, so I hope the following explanation will help clarify some things:
This basically goes through every column in df2 (with apply(df2, 2, FUN)) and checks if its first 2 rows can be found in the 2nd and 3rd elements of every row in df1. If the condition is met, then it returns that column in a data.frame format with its column name being the value of Site in the matching row in df1; otherwise an empty column (with NA's) is returned. These columns are then bound together with do.call and cbind, and then coerced into a data.frame. Finally, we use the Filter function to remove columns whose values are NA's.
All that should give the following:
Site.A Site.B Site.C
1 2 3
5 4 3
40 42 33
13 47 25
23 0 34
2 41 17
10 29 38
43 27 8
31 1 25
31 40 31
34 12 43
43 30 46
46 49 25
45 7 17
2 13 38
28 12 12
16 19 15
39 28 30
41 24 30
10 20 42
11 4 8
33 40 41
34 26 48
2 29 13
38 0 27
38 34 13
30 29 28
47 2 49
22 10 49
45 37 30
29 31 4
25 24 31
I hope this helps.

Related

drawing a value from a vector r

After removing the values from the vector from 1 to 100 I have the following vector:
w
[1] 2 5 13 23 24 39 41 47 48 51 52 58 61 62 70 71 72 90
I am now trying to draw values from this vector with the sample function
for(x in roznica)
{
if(licznik_2 != licznik_1 )
{
roznica_proces_2 <- sample(1:w, roznica)
} else {
roznica_proces_2 <- NA
}
}
I tried various combinations with the sample
If w is the name of the vector then you would NOT use sample(1:w, ...). For one thing 1:w doesn't really amke sense since the : operator expects its second argument to be a single number, while w is apparently on the order of 15 values. Depending on what roznica is (and hopefully it is a single integer) then you might use:
sample(w, roznica) # returns a vector of length roznica's value of randomly ordered values in `w`.
The other problem is that you are currently overwirign any values from prior iterations of the for loop. So you might want to use:
roznica_proces_2[roznica] <- sample(1:w, roznica)
You would of course need to have initialized roznica_proces_2, perhaps with:
roznica_proces_2 <- list()
Regarding your query in the comment :
I am only concerned with the sample function itself: I will show an example : w [1] 31 and now I want to draw 1 number from that in ( which is 31) proces_nr_2 <- sample(w, 1) What does he get? proces_nr_2 [1] 26
The reason that happens is because when a vector is of length 1 the sampling takes from 1 to that number. It is explained in the help page of ?sample.
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x
So if you have only 1 number to sample just return that number directly instead of passing it in sample.

Replacing empty cells in a column with values from another column in R

I am trying to pull the cell values from the StudyID column to the empty cells SigmaID column, but I am running into an odd issue with the output.
This is how my data looks before running commands.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44
ST04238 0 50
LM04829 1 24 LM23921
ST91124 0 89
ST29001 0 55
I tried accomplishing this by writing the syntax in three ways, because I wasn't sure if there is a problem with the way the logic was set up. All three produce the same output.
df$SigmaID <- ifelse(test = df$SigmaID != "", yes = df$SigmaID, no = df$StudyID)
df$SigmaID <- ifelse(df$SigmaID == "", df$StudyID, df3$SigmaID)
df %>% mutate(SigmaID = ifelse(Gender == 0, df$StudyID, df$SigmaID)
Output: instead of pulling the values from from the StudyID column, it is populating one to four digit numbers.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44 5
ST04238 0 50 4908
LM04829 1 24 LM23921
ST91124 0 89 209
ST29001 0 55 4092
I have tried recoding the empty spaces to NA and then calling on NA in the logic, but this produced the same output as seen above. I'm wondering if it could have anything to do with variable type or variable attributes and something's off about how it's reading the characters in StudyID. Would appreciate feedback on this issue!
Here is how to do it:
df$SigmaID[df$SigmaID == ""] = df$StudyID[df$SigmaID == ""]
df[df$SigmaID == ""] selects only the rows where SigmaID==""
I also recommend using data.table instead of data.frame. It is faster and has some useful syntax features:
library(data.table)
setDT(df) # setDT converts a data.frame to a data.table
df[SigmaID=="",SigmaId:=StudyID]
Following up on this! As it turns out, default R converts string types into factors. There are a few ways of addressing the issue above.
i <- sapply[df, is.factor]
df[i] <- lapply(df[i], as.character)
Another method:
df <- read.csv("/insert file pathway here", stringAsFactors = FALSE)
This is what I found to be helpful! I'm sure there are additional methods of troubleshooting this as well.

How to select columns in dataframe using variables

I have a dataframe (data in testdf dataframe included for replication) with 41 columns. I want to rearrange columns therein in a specific manner. This is because there can be variable number of columns in the original dataset due to the variable number of lags of variables used in the time series regression. The original dataset is then manipulated to include as many columns as required to make the total number of lags of the variables to be the same (7 in this case). Hence, in this dataset, there are 6 usage lags variables and 1 has been added (appears at the end).
The dataframe is shown below :-
region,regionalentity,entity,entitycode,dateperiod,dateperiodmax,avgtemp,avgtempcategory,adjavgtemp,adjavgtemptype,dayname,daynum,usage.lag0,intercept,usage.lag-1,usage.lag-2,usage.lag-3,usage.lag-4,usage.lag-5,usage.lag-6,temp.lag0,ciresL1,modelcomputedfittedvalues,usagelevel_L0,usagelevel_L1,fittedlevelusage,fittedlevelusagevar0,fittedlevelusagevarpct,fittedlevelusageabsvar0,fittedlevelusagevarabspct,adjfittedlevelusage,adjfittedlevelusagevar0,adjfittedlevelusagevarpct,adjfittedlevelusageabsvar0,adjfittedlevelusagevarabspct,adjfac,minvarpct,maxvarpct,adjminvarpct,adjmaxvarpct,usage_lag_7
VIC,VIC,VIC_TRU,6,2018-08-08,2019-08-26,12.8,Actual,12.8,Plus_0,Wed,3,-1978.630477847,1,5217.164445177,38381.403272784,-26573.993165182,-3571.086713581,-48.301188955,-865.165969976,1.75,37767.5832575731,-10546.6865414192,154724.804766449,156703.435244296,146156.748702877,8568.05606357215,5.5376098722537,8568.05606357215,5.5376098722537,146156.748702877,8568.05606357215,5.5376098722537,8568.05606357215,5.5376098722537,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-09,2019-08-26,10.5,Actual,10.5,Plus_0,Thu,4,6219.068623674,1,-1978.630477847,5217.164445177,38381.403272784,-26573.993165182,-3571.086713581,-48.301188955,-2.3,49140.1804479394,-4574.58388500458,160943.873390123,154724.804766449,150150.220881444,10793.6525086786,6.70646995211497,10793.6525086786,6.70646995211497,150150.220881444,10793.6525086786,6.70646995211497,10793.6525086786,6.70646995211497,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-10,2019-08-26,14.7,Actual,14.7,Plus_0,Fri,5,-47279.361890857,1,6219.068623674,-1978.630477847,5217.164445177,38381.403272784,-26573.993165182,-3571.086713581,4.2,37811.9212791045,-15456.0153096346,113664.511499266,160943.873390123,145487.858080488,-31823.3465812224,-27.9976099500748,31823.3465812224,27.9976099500748,145487.858080488,-31823.3465812224,-27.9976099500748,31823.3465812224,27.9976099500748,1,25,30,25,30,9999
VIC,VIC,VIC_TRU,6,2018-08-11,2019-08-26,11.4,Actual,11.4,Plus_0,Sat,6,34609.477278232,1,-47279.361890857,6219.068623674,-1978.630477847,5217.164445177,38381.403272784,-26573.993165182,-3.3,22575.5057919593,5349.94126323161,148273.988777498,113664.511499266,119014.452762498,29259.5360150004,19.7334247606353,29259.5360150004,19.7334247606353,119014.452762498,29259.5360150004,19.7334247606353,29259.5360150004,19.7334247606353,1,15,20,15,20,9999
VIC,VIC,VIC_TRU,6,2018-08-12,2019-08-26,10.1,Actual,10.1,Plus_0,Sun,7,181.193161194,1,34609.477278232,-47279.361890857,6219.068623674,-1978.630477847,5217.164445177,38381.403272784,-1.3,32008.3823244177,601.787165323653,148455.181938692,148273.988777498,148875.775942822,-420.594004129642,-0.283313791163811,420.594004129642,0.283313791163811,148875.775942822,-420.594004129642,-0.283313791163811,420.594004129642,0.283313791163811,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-13,2019-08-26,11.4,Actual,11.4,Plus_0,Mon,1,-11354.297567614,1,181.193161194,34609.477278232,-47279.361890857,6219.068623674,-1978.630477847,5217.164445177,1.3,22271.5206463676,-4603.22595618364,137100.884371078,148455.181938692,143851.955982508,-6751.07161143035,-4.92416343074627,6751.07161143035,4.92416343074627,143851.955982508,-6751.07161143035,-4.92416343074627,6751.07161143035,4.92416343074627,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-14,2019-08-26,13.05,Actual,13.05,Plus_0,Tue,2,-17233.144436292,1,-11354.297567614,181.193161194,34609.477278232,-47279.361890857,6219.068623674,-1978.630477847,1.65,20835.2779179977,-6148.02463599273,119867.739934786,137100.884371078,130952.859735085,-11085.1198002992,-9.24779244718395,11085.1198002992,9.24779244718395,130952.859735085,-11085.1198002992,-9.24779244718395,11085.1198002992,9.24779244718395,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-15,2019-08-26,14.95,Actual,14.95,Plus_0,Wed,3,12026.579924003,1,-17233.144436292,-11354.297567614,181.193161194,34609.477278232,-47279.361890857,6219.068623674,1.9,16190.4338545925,-1828.25322565582,131894.319858789,119867.739934786,118039.48670913,13854.8331496588,10.5044956935919,13854.8331496588,10.5044956935919,118039.48670913,13854.8331496588,10.5044956935919,13854.8331496588,10.5044956935919,1,10,15,10,15,9999
VIC,VIC,VIC_TRU,6,2018-08-16,2019-08-26,11.7,Actual,11.7,Plus_0,Thu,4,12449.922399102,1,12026.579924003,-17233.144436292,-11354.297567614,181.193161194,34609.477278232,-47279.361890857,-3.25,42712.6323897985,4460.21257151099,144344.242257891,131894.319858789,136354.5324303,7989.709827591,5.53517736669833,7989.709827591,5.53517736669833,136354.5324303,7989.709827591,5.53517736669833,7989.709827591,5.53517736669833,1,5,10,5,10,9999
VIC,VIC,VIC_TRU,6,2018-08-17,2019-08-26,11.8,Actual,11.8,Plus_0,Fri,5,-9762.010530065,1,12449.922399102,12026.579924003,-17233.144436292,-11354.297567614,181.193161194,34609.477278232,0.1,30367.4176907901,-4864.57340900852,134582.231727826,144344.242257891,139479.668848882,-4897.43712105646,-3.63899235298821,4897.43712105646,3.63899235298821,139479.668848882,-4897.43712105646,-3.63899235298821,4897.43712105646,3.63899235298821,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-18,2019-08-26,11.35,Actual,11.35,Plus_0,Sat,6,22305.952959846,1,-9762.010530065,12449.922399102,12026.579924003,-17233.144436292,-11354.297567614,181.193161194,-0.45,21368.3344560516,-1971.88265162283,156888.184687672,134582.231727826,132610.349076203,24277.8356114689,15.4746105704521,24277.8356114689,15.4746105704521,132610.349076203,24277.8356114689,15.4746105704521,24277.8356114689,15.4746105704521,1,15,20,15,20,9999
VIC,VIC,VIC_TRU,6,2018-08-19,2019-08-26,9.3,Actual,9.3,Plus_0,Sun,7,27244.1359885,1,22305.952959846,-9762.010530065,12449.922399102,12026.579924003,-17233.144436292,-11354.297567614,-2.05,40241.1145869285,-1906.75491557666,184132.320676172,156888.184687672,154981.429772095,29150.8909040767,15.8314905265021,29150.8909040767,15.8314905265021,154981.429772095,29150.8909040767,15.8314905265021,29150.8909040767,15.8314905265021,1,15,20,15,20,9999
VIC,VIC,VIC_TRU,6,2018-08-20,2019-08-26,8.95,Actual,8.95,Plus_0,Mon,1,-6677.68231343,1,27244.1359885,22305.952959846,-9762.010530065,12449.922399102,12026.579924003,-17233.144436292,-0.35,51845.2410212359,-11603.7793657006,177454.638362742,184132.320676172,172528.541310471,4926.0970522706,2.77597536909741,4926.0970522706,2.77597536909741,172528.541310471,4926.0970522706,2.77597536909741,4926.0970522706,2.77597536909741,1,0,5,0,5,9999
VIC,VIC,VIC_TRU,6,2018-08-21,2019-08-26,10.5,Actual,10.5,Plus_0,Tue,2,-14638.358924711,1,-6677.68231343,27244.1359885,22305.952959846,-9762.010530065,12449.922399102,12026.579924003,1.55,42497.3131741632,-16306.3388719971,162816.279438031,177454.638362742,161148.299490745,1667.97994728605,1.02445526518796,1667.97994728605,1.02445526518796,161148.299490745,1667.97994728605,1.02445526518796,1667.97994728605,1.02445526518796,1,0,5,0,5,9999
Therefore, the variables are defined as follows :-
maxusagelag <- 6 (this number can change depending upon the number of lags in the time series regression model selected)
maxpossibleusagelags <- 7
I want the data frame to be arranged like so :-
1. The first 14 columns as they are
2. The next 6 columns as they are - as 'maxusagelag' = 6 (there are 6 lagged variables of usage in the original data set)
3. Then the last variable (named "usage_lag_7") - 1 column in this case because (maxpossibleusagelags - maxusagelag = 1)
4. Then all of the remaining columns in the dataset excluding the last as it has already been moved to a different position in step 3 above
I tried a whole lot of options that I could think of but nothing worked. Here are some of the things that I tried :-
val1 <- ((ncol(testdf) - (maxpossibleusagelags - maxusagelag) + 1):ncol(testdf))
val1 : 41
val2 <- ((15 + maxpossibleusagelags - 1):(ncol(testdf) - (maxpossibleusagelags - maxusagelag)))
val2 : 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
val3 <- ((ncol(testdf) - (maxpossibleusagelags - maxusagelag)))
val3 : 40
paste0(val2, ":", val3)
[1] "21:40"
testdf1 <- testdf[, paste0("c(", val2, ":", val3, ")")]
testdf1 <- dplyr::select(testdf, paste0("c(", val2, ":", val3, ")")])
Error: unexpected ']' in "testdf1 <- dplyr::select(testdf, paste0("c(", val2, ":", val3, ")")]"
Is there something that I can do to select the various columns in a dataframe by position using variable names?
We could try using grep to create a regex to select columns dynamically and use setdiff to select remaining columns at the end.
first_set <- 1:14
second_set <- grep(paste0("lag\\.[1-",maxusagelag, "]"), names(testdf))
combine_set <- c(first_set, second_set, ncol(testdf))
and now use it to subset columns
testdf[,c(combine_set, setdiff(seq_along(testdf), combine_set))]

Nested for loops, different in R

d3:
Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22
I am using this loop:
for ( i in 1:(nrow (d3)-1) ){
for (j in (i+1):nrow(d3)) {
if(c(i) == c(j)) {
print(c(j))
# d4 <- subset.data.frame(c(j))
}
}
}
I want to compare all the rows in Col1 and eliminate the ones that are not the same. Then I want to output a data frame with only the ones that have the same values in col1.
Expected Output:
Col1 Col2
PBR565 22
PBR565 22
PBR565 22
Not sure whats up with my nested loop? Is it because I don't specify the col names?
The OP has requested to compare all the rows in Col1 and eliminate the ones that are not the same.
If I understand correctly, the OP wants to remove all rows where the value in Col1 appears only once and to keep only those rows where the values appears two or more times.
This can be accomplished by finding duplicated values in Col1. The duplicated() function marks the second and subsequent appearences of a value as duplicated. Therefore, we need to scan forward and backward and combine both results:
d3[duplicated(d3$Col1) | duplicated(d3$Col1, fromLast = TRUE), ]
Col1 Col2
2 PBR565 22
3 PBR565 22
4 PBR565 22
The same can be achieved by counting the appearances using the table() function as suggested by Ryan. Here, the counts are filtered to keep only those entries which appear two or more times.
t <- table(d3$Col1)
d3[d3$Col1 %in% names(t)[t >= 2], ]
Please, note that this is different from Ryan's solution which keeps only the rows whose value appears most often. Only one value is picked, even in case of ties. (For the given small sample dataset both approaches return the same result.)
Ryan's answer can be re-written in a slightly more concise way
d3[d3$Col1 == names(which.max(t)), ]
Data
d3 <- data.table::fread(
"Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22", data.table = FALSE)

Remove All Columns where the last row is not equal to specific value x [duplicate]

This question already has an answer here:
Subset columns based on row value
(1 answer)
Closed 4 years ago.
I have a data frame(DF) that is like so:
DF <- rbind (c(10,20,30,40,50), c(21,68,45,33,21), c(11,98,32,10,30), c(50,70,70,70,50))
10 20 30 40 50
21 68 45 33 21
11 98 32 10 30
50 70 70 70 50
In my scenario my x would be 50. So my resulting dataframe(resultDF) will look like this:
10 50
21 21
11 30
50 50
How Can I do this in r? I have attempted using subset as below but it doesn't seem to work as I am expecting:
resultDF <- subset(DF, DF[nrow(DF),] == 50)
Error in x[subset & !is.na(subset), vars, drop = drop] :
(subscript) logical subscript too long
I have solved it. My sub setting was function was inaccurate. I used the following piece of code to get the results I needed.
resultDF <- DF[, DF[nrow(DF),] == 50]
Your issue with subset() was only about the syntax for calling it with a logical column vector (its third arg, not its second). You can either use subset() or plain logical indexing. The latter is recommended.
The help page ?subset tells you its optional second arg ('subset') is a logical row-vector, and its optional third arg ('select') is a logical column-vector:
subset: logical expression indicating elements or rows to keep:
missing values are taken as false.
select: expression, indicating columns to select from a data frame.
So you want to call it with this logical column-vector:
> DF[nrow(DF),] == 50
[1] TRUE FALSE FALSE FALSE
There are two syntactical ways to leave subset()'s second arg default and pass the third arg:
# Explicitly pass the third arg by name...
> subset(DF, select=(DF[nrow(DF),] == 50) )
# Leave 2nd arg empty, it will default (to NULL)...
> subset(DF, , (DF[nrow(DF),] == 50) )
[,1] [,2]
[1,] 10 50
[2,] 21 21
[3,] 11 30
[4,] 50 50
The second way is probably preferable as it looks like generic row,col-indexing, and also doesn't require you to know the third arg's name.
(As a mnemonic, in R and SQL terminology, understand that 'select' implicitly means 'column-indices', and 'filter'/'subset' implicitly means 'row-indices'. Or in data.table terminology they're called i-indices, j-indices respectively.)

Resources