order() function strange behavior - r

I encountered some strange behavior with the order() function.
It was 2 datasets (training and testing), from this code:
train.part <- 0.25
train.ind <- sample.int(n=nrow(newdata), size=floor(train.part*nrow(newdata)), replace=FALSE)
train.set <- newdata[train.ind,]
test.set <- newdata[-train.ind,]
When I try to order train.set by:
train.set <- newdata[train.ind,]
It's all right, but with the second dataset, it's not good:
before sorting:
> test.set
noise.Y noise.Rec
1 7.226370 86.23327
2 3.965446 85.24321
3 5.896981 84.70086
4 4.101038 85.51946
5 7.965455 85.46091
6 8.329555 86.83667
8 6.579297 85.59717
9 7.392187 85.51699
10 5.878640 86.95244
...
after sorting:
> test.set<-test.set[order(noise.Y),]
> test.set
noise.Y noise.Rec
2 3.965446 85.24321
4 4.101038 85.51946
11 7.109978 87.44713
...
NA NA NA
NA.1 NA NA
50 17.009351 92.36286
NA.2 NA NA
48 15.452493 92.09277
53 16.514639 91.57661
NA.3 NA NA
...
It was not properly sorting and lot of unexpected NA's.
What's the reason? Thanks!

Works with me.
test.set <- test.set[order(test.set$noise.Y),]
noise.Y noise.Rec
2 3.965446 85.24321
4 4.101038 85.51946
10 5.878640 86.95244
3 5.896981 84.70086
8 6.579297 85.59717
1 7.226370 86.23327
9 7.392187 85.51699
5 7.965455 85.46091
6 8.329555 86.83667
Note that if you want the rownames to be consecutive after sorting you can simply do
row.names(test.set) <- NULL

Related

R Subset "[..]" wrong values

I am pretty selftought in R so I dont have that much expertise.
But when I recently checked some "tripple"-subsetted data, I recognized some errors which I cant explain, even after several expereminental attempts.
> data <- read.csv("MEC7 table_R.csv", header = TRUE, sep = ";")
> show(data)
Accession Nitrogen.supply Replicate chlorophylle.A chlorophylle.B carotenoids phenols flavanoids carbohydrates proteins
1 HM022 Fix 2 35.71822 13.61773 5.328781 349.0419 116.20386 108.6696 4.602691
2 HM022 Fix 4 72.79944 26.31981 12.419295 324.1432 NA 113.3090 10.127329
3 HM022 Fed 3 40.84556 14.78139 7.844996 424.2241 118.89316 149.3712 8.624055
4 HM022 Fed 4 51.43829 18.34177 10.417405 428.2580 NA 120.8249 71.665152
5 HM022 Fix 3 55.42231 18.73236 5.644060 210.9737 104.85056 391.5803 6.043350
6 HM022 Fed 2 55.77691 18.66710 7.606938 359.6686 117.05009 271.1971 7.368690
7 HM306 Fed 3 51.23863 20.18586 8.830432 444.5283 158.76851 324.4583 2.757425
8 HM306 Fix 4 90.45100 31.74905 14.505800 388.7416 113.75900 250.3901 16.132638
9 HM022 Fix 1 66.61150 22.31275 11.134300 328.1411 152.92800 387.9007 7.821263
10 HM022 Fed 1 67.63950 21.24895 12.076750 483.3238 130.51800 273.0234 6.382024
11 HM306 Fed 2 65.51469 22.45891 13.603800 NA NA NA NA
12 HM306 Fix 2 117.65653 37.67211 20.725213 NA NA NA NA
13 HM306 Fix 3 100.54241 34.42628 20.371192 NA NA 1273.3765 244.340559
14 HM306 Fed 1 58.26609 22.47596 10.582551 317.4561 99.01719 319.1538 5.822906
15 HM306 Fed 1 66.12246 22.63671 13.222049 399.5537 96.86456 437.4982 4.082517
16 HM306 Fed 4 84.31291 29.05635 17.092142 411.3784 75.22140 387.2773 5.593258
> Chla <- data$chlorophylle.A
> Chla022 <- Chla[data$Accession=="HM022"]
> Chla022Fix <- Chla022[data$Nitrogen.supply=="Fix"]
> Chla022Fix
[1] 35.71822 72.79944 55.42231 67.63950 NA NA NA
As u can see the value of line 9 ("Fix") was confused with that from line 10 ("Fed") and similar errors appear in other columns as well (chlb, carotenoids etc.). Also the NA values are not clear to me.
Do I miss something? This is very disturbing for me.
Thanks for help in advance :)
Is this what you are looking for?
library(dplyr)
df %>%
filter(Accession=="HM022" & Nitrogen.supply=="Fix") %>%
pull(chlorophylle.A)
[1] 35.71822 72.79944 55.42231 66.61150
And in base R you can do following solution. Here [[() is an equivalent to function(x) x[["chlorophylle.A"]] , in case you would like to extract chlorophylle.A column of your data frame.
`[[`(subset(df, Accession=="HM022" & Nitrogen.supply=="Fix"), "chlorophylle.A")
[1] 35.71822 72.79944 55.42231 66.61150
Or in case you wanted to stick by your own code. First keep in mind that we subset our data set by rows where variables Accession and Nitrogen.supply meet our requirements. After that we extract our desired column which is chlorophylle.A:
df <- df[df$Accession == "HM022" & df$Nitrogen.supply == "Fix", ]
df <- df$chlorophylle.A
[1] 35.71822 72.79944 55.42231 66.61150

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)
Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])
coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092

how to declare a global variable within a for loop and why is the else statement read as unexpected?

I have datasets that have sulfate and nitrate columns in them. Depending on what the user chooses, either sulfate mean or nitrate mean is returned. I have a for loop and within it I have an IF and ELSE statement to sort this out. The following error arises when attempting to compile data.frame(datada,vec1):
"Error in data.frame(datada, vec1) : object 'datada' not found"
Also, the else statement is considered unexpected. The following error is given:
"Error: unexpected 'else' in " else"
complete <- function(directory,pollutant = "sulfate", id = 1:332) {
datada <- id
filelist <- list.files(path = directory, pattern = ".csv", full.names = TRUE)
vec <- numeric()
vec1 <- numeric()
vec2 <- numeric()
for(i in datada) {
if (pollutant == "sulfate"){
data <- read.csv(filelist[i])
vec1<- c(vec1, colMeans(data$sulfate,na.rm = TRUE )
}
data.frame(datada,vec1) #datada is not "found"
else (pollutant == "nitrate"){ #else is "unexpected"
data <- read.csv(filelist[i])
vec2<- c(vec2, colMeans(data$sulfate,na.rm = TRUE )
}
data.frame(datada,vec2)
}
Here is what one dataset looks like:
Date sulfate nitrate ID
1 2001-01-01 NA NA 2
2 2001-01-02 NA NA 2
3 2001-01-03 NA NA 2
4 2001-01-04 NA NA 2
5 2001-01-05 NA NA 2
6 2001-01-06 NA NA 2
7 2001-01-07 NA NA 2
8 2001-01-08 NA NA 2
9 2001-01-09 NA NA 2
10 2001-01-10 NA NA 2
11 2001-01-11 NA NA 2
12 2001-01-12 NA NA 2
13 2001-01-13 NA NA 2
14 2001-01-14 NA NA 2
15 2001-01-15 NA NA 2
16 2001-01-16 NA NA 2
17 2001-01-17 NA NA 2
18 2001-01-18 NA NA 2
19 2001-01-19 2.30 0.699 2
20 2001-01-20 NA NA 2
21 2001-01-21 NA NA 2
22 2001-01-22 NA NA 2
23 2001-01-23 NA NA 2
24 2001-01-24 NA NA 2
25 2001-01-25 2.19 4.970 2
Its expected to return something like this:
datada vec
1 1 117
2 3 243
3 5 402
4 7 442
5 9 275
Generated by the data.frame(datada,vec1)
Unless you want to manipulate environment objects, the easiest thing to do is to declare your variable outside the function and use <<- form of assignment inside the function.
datada <- NULL
...
complete <- function(directory,pollutant = "sulfate", id = 1:332) {
datada <<- id
...
}
I have no idea why datada is not found - when I tried a simplified version of the function on my system it seems to work fine.
As to the else -- an else must come directly after the end of the if's statement. It's not expected because you placed data.frame(datada,vec1) before it. If you put that line into the {}, everything should be fine.
But generally speaking your code is unnecessarily complex, plus it doesn't actually return anything.
Try something like this:
complete <- function(directory,pollutant = "sulfate", id = 1:332) {
datada <- id
filelist <- list.files(path = directory, pattern = ".csv", full.names = TRUE)
if (!(pollutant) %in% c("sulfate","nitrate")) stop("Unknown pollutant")
lapply(filelist, function(x) {
data<-read.csv(x)
colMeans(data[,pollutant],na.rm=TRUE)
})
}
This will output a list where each element is the vector of colMeans of each of the files. You could replace lapply with sapply which will (probably) give you a matrix instead of a list.
(note I couldn't test it because I don't have the dataset, so there may be some errors here)

Error in .f(.x[[i]], ...) : object 'Review.Text' not found when using select

I am trying to classify review text into positive and negative sentiment. Therefore I have to select the column Review.Text. However there seems to be a problem with this column as R does not recognize it. Maybe I did not apply the function "select" right. Does somebody have an idea how to fix the issue?
reviewscl <- read.csv("C:/Users/Astrid/Documents/Master BWL/Data Mining mit R/R/Präsentation 2/Womens Clothing Reviews3.csv")
reviewscl2 <- as.data.frame(reviewscl)
reviews2 <- reviewscl %>%
unite("Title", "Review.Text", sep=" ")
reviews2[is.na(reviews2)] <- ""
reviewStars <- as.numeric(reviews2$Rating)
reviews3 <- cbind(reviews2, reviewStars)
reviews_pos <- reviews3 %>%
filter(reviewStars>=4) %>%
select(reviewscl2,Review.Text) %>%
cbind("Valenz"=1)
This is the data.frame. I don't know why there are no columns Title and Review.Text as they exist in the csv file.
Rating Recommended.IND Positive.Feedback.Count Division.Name Department.Name Class.Name...................
1 4 1 0 Initmates Intimate Intimates;;;;;;;;;;;;;;;;;;;
2 NA
3 NA
4 NA
5 5 1 6 General Tops Blouses;;;;;;;;;;;;;;;;;;;
6 NA
7 NA
8 NA
9 5 1 0 General Dresses Dresses;;;;;;;;;;;;;;;;;;;
10 NA
11 3 0 14 General Dresses Dresses;;;;;;;;;;;;;;;;;;;
12 5 1 2 General Petite Dresses Dresses;;;;;;;;;;;;;;;;;;;
13 NA
14 NA
15 3 1 1 General Dresses Dresses;;;;;;;;;;;;;;;;;;;
16 NA
17 NA
18 5 1 0 General Tops Blouses;;;;;;;;;;;;;;;;;;;
19 NA
20 NA
21 NA
Error in .f(.x[[i]], ...) : object 'Review.Text' not found

Fill data frame by column with for loop

I created an empty data frame with 11 columns and 15 rows and subsequently named the columns.
L_df <- data.frame(matrix(ncol = 11, nrow = 15))
names(L_df) <- paste0("L_por", 0:10)
w <- c(0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2, 2.2, 2.4, 2.6, 2.8, 3)
wu <- 0
L <- 333.7
pm <- c(2600, 2574, 2548, 2522, 2496, 2470, 2444, 2418, 2392, 2366, 2340)
The data frame looks like this:
head(L_df)
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
Now, I would like to fill the data frame by column, based on a formula. I tried to express this with a nested for loop:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
col_tmp[j] <- L_por_tmp
}
}
For each column, I iterate over a predefined vector pm of length 11. For each row, I iterate over a predefined vector w of length 15 (repeats each column).
Example: First, select pm[1] for the first column. Second, select w[i] for each row in the first column. Store the formula in L_por_tmp and use it to fill the first column from row1 to row15. The whole procedure should start all over again for the second column (with pm[2]) with w[i] for each row and so on. wu and L are fixed in the formula.
R executes the code without an error. When I check the tmp values, they are correct. However, the data frame remains empty. L_df does not get filled. I would like solve this with a loop but if you have other solutions, I am happy to hear them! I get the impression there might be a smoother way of doing this. Cheers!
Solution
L_df <- data.frame(sapply(pm, function(x) x * L * ((w - wu) / 100)))
names(L_df) <- c("L_por0", "L_por1", "L_por2", "L_por3", "L_por4", "L_por5",
"L_por6", "L_por7", "L_por8", "L_por9", "L_por10")
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320
4 6940.96 6871.550 6802.141 6732.731 6663.322 6593.912 6524.502 6455.093
5 8676.20 8589.438 8502.676 8415.914 8329.152 8242.390 8155.628 8068.866
6 10411.44 10307.326 10203.211 10099.097 9994.982 9890.868 9786.754 9682.639
7 12146.68 12025.213 11903.746 11782.280 11660.813 11539.346 11417.879 11296.412
8 13881.92 13743.101 13604.282 13465.462 13326.643 13187.824 13049.005 12910.186
9 15617.16 15460.988 15304.817 15148.645 14992.474 14836.302 14680.130 14523.959
10 17352.40 17178.876 17005.352 16831.828 16658.304 16484.780 16311.256 16137.732
11 19087.64 18896.764 18705.887 18515.011 18324.134 18133.258 17942.382 17751.505
12 20822.88 20614.651 20406.422 20198.194 19989.965 19781.736 19573.507 19365.278
13 22558.12 22332.539 22106.958 21881.376 21655.795 21430.214 21204.633 20979.052
14 24293.36 24050.426 23807.493 23564.559 23321.626 23078.692 22835.758 22592.825
15 26028.60 25768.314 25508.028 25247.742 24987.456 24727.170 24466.884 24206.598
L_por8 L_por9 L_por10
1 1596.421 1579.068 1561.716
2 3192.842 3158.137 3123.432
3 4789.262 4737.205 4685.148
4 6385.683 6316.274 6246.864
5 7982.104 7895.342 7808.580
6 9578.525 9474.410 9370.296
7 11174.946 11053.479 10932.012
8 12771.366 12632.547 12493.728
9 14367.787 14211.616 14055.444
10 15964.208 15790.684 15617.160
11 17560.629 17369.752 17178.876
12 19157.050 18948.821 18740.592
13 20753.470 20527.889 20302.308
14 22349.891 22106.958 21864.024
15 23946.312 23686.026 23425.740
Explanation
The sapply() function can be used to iterate over vectors in a more idiomatic way for R programming. We iterate over pm and use your formula once since R is vectorised; each time it creates a vector of length 15 (so 11 vectors of length 15), and when we wrap it in data.frame() returns the data frame you want and we add in the column names.
NOTE: Applying functions to every element of a vector using an apply() family function has some different implications than iterating using for loops. In your case, I think sapply() is easier and more understandable. For more information on when you need a loop or when something like apply is better, see for example this discussion from Hadley Wickham's Advanced R book.
You are just doing a small mistake and you were almost there, Edited your function:
for (i in 1:ncol(L_df)) {
pm_tmp <- pm[i]
col_tmp <- colnames(L_df)[i]
for (j in 1:nrow(L_df)) {
w_tmp <- w[j]
L_por_tmp <- pm_tmp*L*((w_tmp-wu)/100)
L_df[ j ,col_tmp] <- L_por_tmp ##You must have used df[i, j] referencing here
}
}
Output:
Just printing the head of few rows:
L_df
L_por0 L_por1 L_por2 L_por3 L_por4 L_por5 L_por6 L_por7 L_por8 L_por9 L_por10
1 1735.24 1717.888 1700.535 1683.183 1665.830 1648.478 1631.126 1613.773 1596.421 1579.068 1561.716
2 3470.48 3435.775 3401.070 3366.366 3331.661 3296.956 3262.251 3227.546 3192.842 3158.137 3123.432
3 5205.72 5153.663 5101.606 5049.548 4997.491 4945.434 4893.377 4841.320 4789.262 4737.205 4685.148

Resources