R descale data back to their original values - r

I have the following reproducible data:
MyScaledData contains scaled values between 0 and 1 for 6 variables. minvec and maxvec are named vectors and contain the maximum and minimum values from the original data set that was used to create the scaled data frame MyScaledData. minvec and maxvec contain values for all 22 variables of the original data set, including the 6 variables I have now in MyScaledData.
X14863 X15066 X15067 X15068 X15069 X15070
0.6014784 0.6975109 0.5043208 0.15284648 0.9416364 0.7860731
0.2495215 0.7801444 0.6683925 0.13768245 0.4277954 0.2058412
0.6167705 0.3344044 0.9254125 0.12777565 0.3826231 0.2590457
0.1227380 0.4448501 0.3961802 0.19117246 0.7789835 0.7587897
0.7299760 0.6375931 0.5760061 0.44746838 0.3634903 0.1079679
0.1988647 0.7814712 0.6572054 0.71409305 0.6715690 0.4029459
0.5041371 0.6374958 0.9333635 0.89057831 0.5716711 0.7219823
0.5774327 0.7677038 0.7622717 0.45288270 0.2817869 0.2572325
0.6809509 0.6089656 0.8191862 0.01151454 0.2780449 0.4655353
0.5754383 0.5662045 0.7003630 0.62559642 0.2865510 0.1847980
MyScaledData<-structure(list(X14863=c(0.601478444979532,0.249521497274968,0.616770466379489,0.122737966507165,0.729975993009922,0.198864661389536,0.504137054265617,0.577432671357089,0.680950947164095,0.575438259547452),X15066=c(0.697510926657699,0.780144354632397,0.334404422875259,0.444850091405716,0.637593061483412,0.781471212351781,0.637495834667556,0.7677038048039,0.608965550162107,0.566204459603197),X15067=c(0.50432083998529,0.668392530333367,0.925412484830622,0.396180214305286,0.576006062451239,0.657205387087382,0.933363470346907,0.762271729415789,0.819186151914183,0.700362991098644),X15068=c(0.152846483002917,0.137682446305942,0.127775652495726,0.191172455317975,0.447468375530484,0.714093046059637,0.890578310935752,0.452882699805154,0.011514536383708,0.625596417031532),X15069=c(0.94163636689763,0.427795395079331,0.38262308941233,0.77898345642139,0.363490265569212,0.671568951210917,0.571671115989958,0.281786881885636,0.278044876559552,0.286551022600823),X15070=c(0.786073059382553,0.205841229942702,0.259045736299276,
0.758789694211416,0.107967864736275,0.402945912782515,0.721982268066207,0.257232456508833,0.46553533255268,0.184798001614338)),row.names=c(NA,10L),class="data.frame"); minvec<-c(X14861=22.95,X14862=29.95,X14863= 39.95,X15066=59.95,X15067=79.95,X15068=14.99,X15069=24.99,X15070=33.45,X15071=36.95,X15072=44.95,X15073=54.95,X15074=74.95,X15132=12.95,X15548=12.95,X15549=22.95,X15550=29.95,X15551=39.95,X15552=59.95,X15553=79.95,X15956=49.95,X15957=49.95,X16364=3.5);maxvec<-c(X14861=29.99,X14862=39.99,X14863=49.99,X15066=79.99,X15067=99.99,X15068=19.99,X15069=29.99,X15070=39.99,X15071=49.99,X15072=59.99,X15073=79.99,X15074=99.99,X15132=19.99,X15548=19.99,X15549=29.99,X15550=39.99,X15551=49.99,X15552=79.99,X15553=99.99,X15956=59.99,X15957=59.99,X16364=9.99)
I want to rescale back MyScaledData to their original scale by matching the min/max values to each corresponding column based on name. I've tried the following:
descale <- function(x,minval,maxval) {x*(maxval-minval) + minval}
as.data.frame(Map(descale,MyScaledData,minvec,maxvec))
The output I get has more than 6 columns than MyScaledData has. I sense that the function is not even matching columns by names and therefore the output is not calculated correctly. How can I match the function by column name so it takes the corresponding minvec and maxvec element for each column and return only the 6 columns I have?
Desired output shall be:
MyDeScaledData <- structure(list(X14863 = c(45.9888435875945, 42.4551958326407,46.1423754824501, 41.1822891837319, 47.2789589698196, 41.9466012003509,45.0115360248268, 45.7474240204252, 46.7867475095275, 45.7274001258564), X15066 = c(73.9281189702203, 75.5840928668332, 66.6514646344202,68.8647958317706, 72.7273649521276, 75.6106830955297, 72.7254165267378,75.3347842482702, 72.1536696252486, 71.2967373704481), X15067 = c(90.0565896333052,93.3445863078807, 98.4952661960057, 87.8894514946779, 91.4931614915228,93.1203959572311, 98.654603945752, 95.2259254574924, 96.3664904843602,93.9852743416168), X15068 = c(15.7542324150146, 15.6784122315297,15.6288782624786, 15.9458622765899, 17.2273418776524, 18.5604652302982,19.4428915546788, 17.2544134990258, 15.0475726819185, 18.1179820851577), X15069 = c(29.6981818344881, 27.1289769753967, 26.9031154470616,28.8849172821069, 26.8074513278461, 28.3478447560546, 27.8483555799498,26.3989344094282, 26.3802243827978, 26.4227551130041), X15070 = c(38.5909178083619,34.7962016438253, 35.1441591153973, 38.4124846001427, 34.1561098353752,36.0852662695976, 38.171764033153, 35.1323002655678, 36.4946010748945,34.6585789305578)), row.names = c(NA, 10L), class = "data.frame")

Thanks to #Shirin Yavari for providing the solution:
MyDeScaledData<-as.data.frame(Map(descale,MyScaledData,minvec[names(MyScaledData)],maxvec[names(MyScaledData)]))

Related

How can I remove all rows in which one of the values satisfies a condition? apply() not working

I have a dataframe with two columns, and I want to remove all rows in which one of the values in each row is either smaller than 0 or bigger than a specified number (for the sake of argument let's call this 2000).
This is the dataframe
structure(list(xx = c(134.697838289433, 222.004361198059, 131.230956160172,
206.658871436917, 111.25078650042, 241.965831417648, 171.46912254679,
116.860666678254, 196.894985820028, 135.309699618638, 133.082437475133,
185.509376072318, 718.998297748551, 745.902984215293, 752.655615982603,
633.199684348903, 764.983924278636, 694.856525559398, 773.56532078895,
757.32358575657, 709.924023536199, 658.863564702233, 733.076690816291,
745.9306541374, 788.134444412421, 759.445624288787, 796.989170170713,
632.952543475636, 746.103571612919, 715.296116988119, 766.899107551248,
628.268453830605, 658.574104878488, 689.916530654021, 820.841422812349,
709.097957368612, 793.109262845978, 716.713801941779, 726.83260343463,
746.547080776193, 759.644057119419, 757.41275593749, 723.539527360327,
839.816318612061, 795.655016954661, 766.245386324182, 756.300015395758,
808.255074043333, 745.915083305187, 685.465492956583, 694.567959198318,
786.919467838804, 699.521900871042, 749.041223560884, 700.079697765533,
753.805501259023, 745.080253997501, 846.982894686656, 775.66384433188,
809.39649823454, 841.009469183585, 790.987061753069, 792.441925234251,
1377.97739642236, 1353.19738061511, 1259.94435540633, 1276.25060187203,
1331.26106031956, 1227.68481147557, 1345.95561236514, 1309.51489973952,
1285.62680259649, 1329.46388049714, 1256.00394500077, 1294.0505313591,
1349.09440181876, 1294.72661682462, 1339.38577920408, 1277.114896541,
1267.54884404031, 1291.32793111573, 1254.85565551553, 1298.78499697743,
1283.89664572036, 1273.92831816666, 1310.221891323, 1327.89682404014,
1310.81394400863, 595.342571560588, 689.892254230306, 562.390766853428,
736.319251501976, 609.577261412134, 641.591997384705, 682.957658696869,
580.320759093636, 560.64984978551, 643.487033739876, 688.457314818318,
631.156743281308, 659.535909106305), yy = c(1169.70954243065,
1259.830208937, 1172.21661417439, 1097.62724268622, 1198.15024522658,
1231.90665701131, 1211.36196331211, 1152.4207367321, 1287.57553021171,
1120.61366993258, 1234.70366243878, 1258.47454705197, 893.983957068268,
994.99854601335, 916.330965835536, 947.536265806389, 950.345051732045,
934.313361799171, 1018.76942964176, 918.182358835366, 1005.51128858608,
967.577307930044, 997.239384198691, 995.866808447868, 962.292293255127,
864.624084608006, 895.091604672023, 906.22162647536, 1024.45206885923,
908.693026118345, 923.625774785301, 931.801569764776, 1007.88553380827,
848.55309782664, 927.608364899483, 1024.60765786828, 1085.64295260059,
1057.90632135992, 1195.30607038065, 1151.39888340311, 1168.2831257626,
1137.15375447446, 1145.42393212912, 1108.89072769468, 1075.15451622384,
1129.91711324634, 1191.94330388541, 1132.41649984784, 1210.89342724886,
1100.60339252755, 1083.5987922884, 1056.69487941162, 1150.2707936581,
1055.75678264632, 1055.53323667429, 1049.79655119467, 1166.86598024805,
1141.82593378866, 1066.37755267981, 1160.55793904653, 1162.65728735716,
1060.29360609309, 1107.40480300404, 1825.01445883899, 1802.95011068891,
1692.84948509132, 1675.97166713074, 1758.10341887143, 1788.48414279738,
1680.15824054313, 1756.01930833023, 1706.98458587119, 1770.57687329296,
1692.21991398915, 1835.60585163662, 1790.6487914694, 1787.52076839767,
1704.25313427813, 1735.96312434652, 1813.02044772293, 1847.21159474717,
1725.63580525853, 1841.32016678, 1713.80845602987, 1770.39756152819,
1747.72988313376, 1778.13110060636, 1786.3871288087, 6.01666671271317,
19.2497357431764, 9.6964112500295, -3.23929433528044, 89.4863211231715,
86.0082947221296, 42.7982120490919, 2.19886414532234, 12.8780844043502,
30.694893442471, 7.58386594976601, 83.8385161493349, 36.4551491976192
)), row.names = 100:200, class = "data.frame")
First I create a function to eliminate points which satisfy the conditions.
routliers<-function(x){
if(x>2000|x<0){
rm(x)
}
}
Then I use the apply function across the rows to eliminate the points using the above function (the above dput() is named cds).
cds<-data.frame(apply(cds,1,routliers))
But this eliminates all points
length(cds)
[1]0
Interestingly, if I replace the rm() function with print(), then I do print out the desired points when using the apply function, but I receive the error "arguments imply differing number of rows: 0, 2". Also, I am not sure when I use the apply() function that the specified function applies to both columns of data, as I am not seeing any data points in print() that satisfies the condition for ONLY the second column of points. The first column is x co-ordinates, and the second column is y co-ordinates. I think the error "arguments imply differing number of rows:0,2" suggests that only the first value in the row is being tested against the function.
How can I write code in which rows are eliminated if one or more of the data points satisfies my condition?
This is easy to do when the columns are separate vectors, (x<-x[!condition]) however I cannot add them together again easily so I prefer to do this on a dataframe of points.
Please check if this code works for you, with df being the data you shared:
#Code
new <- df[!rowSums(df < 0 | df>2000) > 0, ]
Or this:
#Code 2
new <- df[which(apply(df,1,function(x) sum(x<0 | x>2000))==0),]
Let's make your function return TRUE for an outlier and FALSE for not an outlier. And it can be vectorized:
is_outlier = function(x) {
x > 2000 | x < 0
}
Here's how we'd use this to drop rows with outliers in a single column:
cds[!is_outlier(cds$xx), ]
For two columns, we can combine the is_outlier results with & or |. I can't tell from your text whether you want to remove rows where xx AND yy are outliers, or remove rows where xx OR yy are outliers. So pick the appropriate version:
cds[!is_outlier(cds$xx) & !is_outlier(cds$yy), ]
cds[!is_outlier(cds$xx) | !is_outlier(cds$yy), ]

Labelling Variables

I have a series of variables that fall under one related question: lets say there are 20 such variables in my dataframe, each one corresponds to an option on a MC question. They are titled popn1, popn2......popn20.
I want to label each variable by its option, as an example: (popn1 = Everyone; popn2=Children)
I'm using the labelVector package.
Is there a way I can do it without writing out each variable name? Ex. is there a paste function I can use, such as
df2 <- Set_label(df1,
(paste0(popn, 1:20) = "Everyone", "Children", .... "Youth"?)
This can be done in base R quite easily. Here's some sample data (using columns instead of 20, to make it easier to view)
popn1 popn2 popn3 popn4 popn5
1 -0.4085141 3.240716 2.730837 6.428722 8.015210
2 3.1378943 2.512700 2.021546 3.333371 5.654401
3 2.4073278 1.475619 2.449742 2.817447 6.295569
It looks like you already have your new column names in a character vector:
your_column_names <- c("Everyone", "Youth", "Someone", "Something", "Somewhere")
Then you just use the setNames argument on the column names for your data:
colnames(data) <- setNames(your_column_names, colnames(data))
Everyone Youth Someone Something Somewhere
1 -0.4085141 3.240716 2.730837 6.428722 8.015210
2 3.1378943 2.512700 2.021546 3.333371 5.654401
3 2.4073278 1.475619 2.449742 2.817447 6.295569
Sample Data:
data <- structure(list(popn1 = c(-0.408514139489243, 3.13789432899688,
2.40732780606037), popn2 = c(3.24071608151551, 2.51269963339946,
1.47561933493116), popn3 = c(2.73083728435832, 2.02154567048998,
2.44974180329751), popn4 = c(6.42872215439841, 3.3333709733048,
2.81744655980154), popn5 = c(8.0152099281755, 5.65440141443164,
6.29556905855252)), class = "data.frame", row.names = c(NA, -3L
))

Applying a 3-D function over a column in R

I am trying to apply a function that returns a list over a column. And I can't get around the "cannot take a sample larger than the population when 'replace = False'" error. I am trying to average my data in a column using lineMeanVariance and store the result in a new column of the data frame. I have tried using as.matrix instead of list, but still get the same error.
aveK <- function(k){
list(lineMeanVariance(k, numSeeds=5, numSteps=100)$mean)
}
amsData$k1.ave<- lapply(amsData$k1, aveK)
lineMeanVariance is a function from the larger code package I am working with. "An iterative algorithm for computing the Frechet mean"
##param - a list of lines
##return - A list consisting of $variance (a real number), $mean (a
#line), $error (an integer), and $minEigenvalue(a real number)
lineMeanVariance <- function(us, numSeeds=5, numSteps=100){
I am trying to get $mean which is a list: double[3].
Here is a portion of my dput(head(amsData)
structure(list(specimen = c("FL14-01_1a", "FL14-01_2a", "FL14-01_3a",
"FL14-01_4a", "FL14-01_5a", "FL14-01_5b"),
k1 =
list(c(-0.41060908075407, 0.772243365085709, -0.484809620246337),
c(-0.150521635394175, 0.651980832267664, -0.743144825477394),
c(-0.688516184675898, 0.335811781021467, -0.642787609686539),
c(-0.804101914050799, 0.156301578631426, -0.573576436351046),
c(-0.828532545265687, -0.0289329940296337, -0.559192903470747),
c(-0.637514091418047, -0.463181099617909, -0.615661475325658)),
k2 =
list(c(0.714768314528387, -0.0575182499604218, -0.696992042614361),
c(0.961496547169293, 0.271378151982945, 0.0433392248182801),
c(0.637691066683676, 0.702459457296904, -0.316070900789642),
c(0.547534946649107, 0.570538417729676, -0.612120409798986),
c(-0.424748642510372, 0.683208124163481, 0.593982533213405),
c(-0.59310308550509, 0.805082008191702, 0.0084669977181137)),
k3 =
list(c(-0.567454936802625, -0.630222554414604, -0.529919264233205),
c(0.229644380354319, -0.7067727288213, -0.669130606358858),
c(0.348742855430257, -0.629148765505902, -0.694658370458997),
c(0.231168937750018, -0.806181892476771, -0.544639035015027),
c(0.371887245949431, 0.72986981576373, -0.573576436351046),
c(0.491689116363176, 0.370514325026935, -0.788010753606722)),
site = c("FL14-01", "FL14-01", "FL14-01",
"FL14-01", "FL14-01", "FL14-01")), row.names = c(NA, 6L), class =
"data.frame")

How to prevent coercion to list in R

I am trying to remove all NA values from two columns in a matrix and make sure that neither column has a value that the other doesn't.
code:
data <- dget(file)
dependent <- data[,"chroma"]
independent <- data[,"mass..Pantheria."]
names(independent) <- names(dependent) <- rownames(data)
for (name in rownames(data)) {
if(is.na(dependent[name])) {
independent$name <- NULL
dependent$name <- NULL
}
if(is.na(independent[name])) {
independent$name <- NULL
dependent$name <- NULL
}
}
print(dput(independent))
print(dput(dependent))
I am brand new to R and am trying to perform this task with a for loop. However, when I delete a section by assigning NULL I receive the following warning:
1: In independent$Aeretes_melanopterus <- NULL : Coercing LHS to a list
2: In dependent$name <- NULL : Coercing LHS to a list
No elements are deleted and independent and dependent retain all their original rows.
file (input):
structure(list(chroma = c(7.443501276, 10.96156313, 13.2987235,
17.58110922, 13.4991105), mass..Pantheria. = c(NA, 126.57, NA,
160.42, 250.57)), .Names = c("chroma", "mass..Pantheria."), class = "data.frame", row.names = c("Aeretes_melanopterus",
"Ammospermophilus_harrisii", "Ammospermophilus_insularis", "Ammospermophilus_nelsoni",
"Atlantoxerus_getulus"))
chroma mass..Pantheria.
Aeretes_melanopterus 7.443501 NA
Ammospermophilus_harrisii 10.961563 126.57
Ammospermophilus_insularis 13.298723 NA
Ammospermophilus_nelsoni 17.581109 160.42
Atlantoxerus_getulus 13.499111 250.57
desired output:
structure(list(chroma = c(10.96156313, 17.58110922, 13.4991105
), mass..Pantheria. = c(126.57, 160.42, 250.57)), .Names = c("chroma",
"mass..Pantheria."), class = "data.frame", row.names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
chroma mass..Pantheria.
Ammospermophilus_harrisii 10.96156 126.57
Ammospermophilus_nelsoni 17.58111 160.42
Atlantoxerus_getulus 13.49911 250.57
structure(c(126.57, 160.42, 250.57), .Names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
Ammospermophilus_harrisii Ammospermophilus_nelsoni Atlantoxerus_getulus
126.57 160.42 250.57
structure(c(10.96156313, 17.58110922, 13.4991105), .Names = c("Ammospermophilus_harrisii",
"Ammospermophilus_nelsoni", "Atlantoxerus_getulus"))
Ammospermophilus_harrisii Ammospermophilus_nelsoni Atlantoxerus_getulus
10.96156 17.58111 13.49911
Looks like you want to omit rows from your data where chroma or mass..Pantheria are NA. Here's a quick way to do it:
data = data[!is.na(data$chroma) & !is.na(data$mass..Pantheria.), ]
I'm not sure why you are breaking independent and dependent out separately, but after filtering out bad observations is a good time to do it.
Since those are your only two columns, this is equivalent to omitting rows from your data frame that have any NA values, so you can use a shortcut like this:
data = na.omit(data)
If you want to keep a "pristine" copy of your raw data, simply change the name of the result:
data_no_na = na.omit(data)
# or
data = data[!is.na(data$chroma) & !is.na(data$mass..Pantheria.), ]
As to what's wrong with your code, $ is used for extracting columns from a data frame, but you're trying to use it for a named vector (since you've already extracted the columns), which doesn't work. Even then, $ only works with a literal string, you can't use it with a variable. For data frames, you need to use brackets to extract columns stored in variables. For example, the built-in mtcars data has a column called "mpg":
# these work:
mtcars$mpg
mtcars[, "mpg"]
my_col = "mpg"
mtcars[, my_col]
mtcars$my_col ## does not work, need to use brackets!
You can never use $ with row names in a data frame, only column names.

Lag in R dataframe

I have the following sample dataset (below and/or as CSVs here: http://goo.gl/wK57T) which I want to transform as follows. For each person in a household I want to create two new variables OrigTAZ and DestTAZ. It should take the value in TripendTAZ and put that in DestTAZ. For OrigTAZ it should put value of TripendTAZ from the previous row. For the first trip of every person in a household (Tripid = 1) the OrigTAZ = hometaz. For each person in a household, from the second trip OrigTAZ = TripendTAZ_(n-1) and DestTAZ = TripEndTAZ. The sample input and output data are shown below. I tried the suggestions shown here: Basic lag in R vector/dataframe but have not had luck. I am used to doing something like this in SAS.
Any help is appreciated.
TIA,
Krishnan
SAS Code Sample
if Houseid = lag(Houseid) then do;
if Personid = lag(Personid) then do;
DestTAZ = TripendTAZ;
if Tripid = 1 then OrigTAZ = hometaz
else
OrigTAZ = lag(TripendTAZ);
end;
end;
INPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ
1,1,1,45,4
1,1,2,45,7
1,1,3,45,87
1,1,4,45,34
1,1,5,45,45
2,1,1,8,96
2,1,2,8,4
2,1,3,8,2
2,1,4,8,1
2,1,5,8,8
2,2,1,8,58
2,2,2,8,67
2,2,3,8,9
2,2,4,8,10
2,2,5,8,8
3,1,1,7,89
3,1,2,7,35
3,1,3,7,32
3,1,4,7,56
3,1,5,7,7
OUTPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ,OrigTAZ,DestTAZ
1,1,1,45,4,45,4
1,1,2,45,7,4,7
1,1,3,45,87,7,87
1,1,4,45,34,87,34
1,1,5,45,45,34,45
2,1,1,8,96,8,96
2,1,2,8,4,96,4
2,1,3,8,2,4,2
2,1,4,8,1,2,1
2,1,5,8,8,1,8
2,2,1,8,58,8,58
2,2,2,8,67,58,67
2,2,3,8,9,67,9
2,2,4,8,10,9,10
2,2,5,8,8,10,8
3,1,1,7,89,7,89
3,1,2,7,35,89,35
3,1,3,7,32,35,32
3,1,4,7,56,32,56
3,1,5,7,7,56,7
Just proceed through the steps you outlined step-by-step and it isn't so bad.
First I'll read in your data by copying it:
df <- read.csv(file('clipboard'))
Then I'll sort to make sure the data frame is ordered by houseid, then personid, then tripid:
# first sort so that it's ordered by Houseid, then Personid, then Tripid:
df <- with(df, df[order(Houseid,Personid,Tripid),])
Then follow the steps you specified:
# take value in TripendTAZ and put it in DestTAZ
df$DestTAZ <- df$TripendTAZ
# Set OrigTAZ = value from previous row
df$OrigTAZ <- c(NA,df$TripendTAZ[-nrow(df)])
# For the first trip of every person in a household (Tripid = 1),
# OrigTAZ = hometaz.
df$OrigTAZ[ df$Tripid==1 ] <- df$hometaz[ df$Tripid==1 ]
You'll notice that df is then what you're after.

Resources