R Subset "[..]" wrong values - r

I am pretty selftought in R so I dont have that much expertise.
But when I recently checked some "tripple"-subsetted data, I recognized some errors which I cant explain, even after several expereminental attempts.
> data <- read.csv("MEC7 table_R.csv", header = TRUE, sep = ";")
> show(data)
Accession Nitrogen.supply Replicate chlorophylle.A chlorophylle.B carotenoids phenols flavanoids carbohydrates proteins
1 HM022 Fix 2 35.71822 13.61773 5.328781 349.0419 116.20386 108.6696 4.602691
2 HM022 Fix 4 72.79944 26.31981 12.419295 324.1432 NA 113.3090 10.127329
3 HM022 Fed 3 40.84556 14.78139 7.844996 424.2241 118.89316 149.3712 8.624055
4 HM022 Fed 4 51.43829 18.34177 10.417405 428.2580 NA 120.8249 71.665152
5 HM022 Fix 3 55.42231 18.73236 5.644060 210.9737 104.85056 391.5803 6.043350
6 HM022 Fed 2 55.77691 18.66710 7.606938 359.6686 117.05009 271.1971 7.368690
7 HM306 Fed 3 51.23863 20.18586 8.830432 444.5283 158.76851 324.4583 2.757425
8 HM306 Fix 4 90.45100 31.74905 14.505800 388.7416 113.75900 250.3901 16.132638
9 HM022 Fix 1 66.61150 22.31275 11.134300 328.1411 152.92800 387.9007 7.821263
10 HM022 Fed 1 67.63950 21.24895 12.076750 483.3238 130.51800 273.0234 6.382024
11 HM306 Fed 2 65.51469 22.45891 13.603800 NA NA NA NA
12 HM306 Fix 2 117.65653 37.67211 20.725213 NA NA NA NA
13 HM306 Fix 3 100.54241 34.42628 20.371192 NA NA 1273.3765 244.340559
14 HM306 Fed 1 58.26609 22.47596 10.582551 317.4561 99.01719 319.1538 5.822906
15 HM306 Fed 1 66.12246 22.63671 13.222049 399.5537 96.86456 437.4982 4.082517
16 HM306 Fed 4 84.31291 29.05635 17.092142 411.3784 75.22140 387.2773 5.593258
> Chla <- data$chlorophylle.A
> Chla022 <- Chla[data$Accession=="HM022"]
> Chla022Fix <- Chla022[data$Nitrogen.supply=="Fix"]
> Chla022Fix
[1] 35.71822 72.79944 55.42231 67.63950 NA NA NA
As u can see the value of line 9 ("Fix") was confused with that from line 10 ("Fed") and similar errors appear in other columns as well (chlb, carotenoids etc.). Also the NA values are not clear to me.
Do I miss something? This is very disturbing for me.
Thanks for help in advance :)

Is this what you are looking for?
library(dplyr)
df %>%
filter(Accession=="HM022" & Nitrogen.supply=="Fix") %>%
pull(chlorophylle.A)
[1] 35.71822 72.79944 55.42231 66.61150
And in base R you can do following solution. Here [[() is an equivalent to function(x) x[["chlorophylle.A"]] , in case you would like to extract chlorophylle.A column of your data frame.
`[[`(subset(df, Accession=="HM022" & Nitrogen.supply=="Fix"), "chlorophylle.A")
[1] 35.71822 72.79944 55.42231 66.61150
Or in case you wanted to stick by your own code. First keep in mind that we subset our data set by rows where variables Accession and Nitrogen.supply meet our requirements. After that we extract our desired column which is chlorophylle.A:
df <- df[df$Accession == "HM022" & df$Nitrogen.supply == "Fix", ]
df <- df$chlorophylle.A
[1] 35.71822 72.79944 55.42231 66.61150

Related

Error in .f(.x[[i]], ...) : object 'Review.Text' not found when using select

I am trying to classify review text into positive and negative sentiment. Therefore I have to select the column Review.Text. However there seems to be a problem with this column as R does not recognize it. Maybe I did not apply the function "select" right. Does somebody have an idea how to fix the issue?
reviewscl <- read.csv("C:/Users/Astrid/Documents/Master BWL/Data Mining mit R/R/Präsentation 2/Womens Clothing Reviews3.csv")
reviewscl2 <- as.data.frame(reviewscl)
reviews2 <- reviewscl %>%
unite("Title", "Review.Text", sep=" ")
reviews2[is.na(reviews2)] <- ""
reviewStars <- as.numeric(reviews2$Rating)
reviews3 <- cbind(reviews2, reviewStars)
reviews_pos <- reviews3 %>%
filter(reviewStars>=4) %>%
select(reviewscl2,Review.Text) %>%
cbind("Valenz"=1)
This is the data.frame. I don't know why there are no columns Title and Review.Text as they exist in the csv file.
Rating Recommended.IND Positive.Feedback.Count Division.Name Department.Name Class.Name...................
1 4 1 0 Initmates Intimate Intimates;;;;;;;;;;;;;;;;;;;
2 NA
3 NA
4 NA
5 5 1 6 General Tops Blouses;;;;;;;;;;;;;;;;;;;
6 NA
7 NA
8 NA
9 5 1 0 General Dresses Dresses;;;;;;;;;;;;;;;;;;;
10 NA
11 3 0 14 General Dresses Dresses;;;;;;;;;;;;;;;;;;;
12 5 1 2 General Petite Dresses Dresses;;;;;;;;;;;;;;;;;;;
13 NA
14 NA
15 3 1 1 General Dresses Dresses;;;;;;;;;;;;;;;;;;;
16 NA
17 NA
18 5 1 0 General Tops Blouses;;;;;;;;;;;;;;;;;;;
19 NA
20 NA
21 NA
Error in .f(.x[[i]], ...) : object 'Review.Text' not found

Extract string between prefix and suffix

I have these columns:
text.NANA text.22 text.32
1 Female RNDM_MXN95.tif No NA
12 Male RNDM_QOS38.tif No NA
13 Female RNDM_WQW90.tif No NA
14 Male RNDM_BKD94.tif No NA
15 Male RNDM_LGD67.tif No NA
16 Female RNDM_AFP45.tif No NA
I want to create a column that only has the barcode that starts with RNDM_ and ends with .tif, but not including .tif. The tricky part is to get rid of the gender information that is also in the same column. There are a random amount of spaces between the gender information and the RNDM_:
text.NANA text.22 text.32 BARCODE
1 Female RNDM_MXN95.tif No NA RNDM_MXN95
12 Male RNDM_QOS38.tif No NA RNDM_QOS38
13 Female RNDM_WQW90.tif No NA RNDM_WQW90
14 Male RNDM_BKD94.tif No NA RNDM_BKD94
15 Male RNDM_LGD67.tif No NA RNDM_LGD67
16 Female RNDM_AFP45.tif No NA RNDM_AFP45
I made a very poor attempt with this, but it didn't work:
dfrm$BARCODE <- regexpr("RNDM_", dfrm$text.NANA)
# [1] 8 6 9 7 7 8 9 9 8 8 9 9 6 6 7 8 9 8
# attr(,"match.length")
# [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# attr(,"useBytes")
# [1] TRUE
Please help. Thanks!
So you just want to remove the file extension? Use file_path_sans_ext:
dfrm$BARCODE = file_path_sans_ext(dfrm$text.NANA)
If there’s more stuff in front, you can use the following regular expression to extract just the suffix:
dfrm$BARCODE = stringr::str_match(dfrm$text.NANA, '(RNDM_.*)\\.tif')[, 2]
Note that I’m using the {stringr} package here because the base R functions for extracting regex matches are terrible. Nobody uses them.
I strongly recommend against using strsplit here because it’s underspecified: from reading the code it’s absolutely not clear what the purpose of that code is. Write code that is self-explanatory, not code that requires explanation in a comment.
You can use sapply() and strsplit to do it easy, let me show you:
sapply(strsplit(dfrm$text.NANA, "_"),"[", 1)
That should work.
Edit:
sapply(strsplit(x, "[ .]+"),"[", 2)

order() function strange behavior

I encountered some strange behavior with the order() function.
It was 2 datasets (training and testing), from this code:
train.part <- 0.25
train.ind <- sample.int(n=nrow(newdata), size=floor(train.part*nrow(newdata)), replace=FALSE)
train.set <- newdata[train.ind,]
test.set <- newdata[-train.ind,]
When I try to order train.set by:
train.set <- newdata[train.ind,]
It's all right, but with the second dataset, it's not good:
before sorting:
> test.set
noise.Y noise.Rec
1 7.226370 86.23327
2 3.965446 85.24321
3 5.896981 84.70086
4 4.101038 85.51946
5 7.965455 85.46091
6 8.329555 86.83667
8 6.579297 85.59717
9 7.392187 85.51699
10 5.878640 86.95244
...
after sorting:
> test.set<-test.set[order(noise.Y),]
> test.set
noise.Y noise.Rec
2 3.965446 85.24321
4 4.101038 85.51946
11 7.109978 87.44713
...
NA NA NA
NA.1 NA NA
50 17.009351 92.36286
NA.2 NA NA
48 15.452493 92.09277
53 16.514639 91.57661
NA.3 NA NA
...
It was not properly sorting and lot of unexpected NA's.
What's the reason? Thanks!
Works with me.
test.set <- test.set[order(test.set$noise.Y),]
noise.Y noise.Rec
2 3.965446 85.24321
4 4.101038 85.51946
10 5.878640 86.95244
3 5.896981 84.70086
8 6.579297 85.59717
1 7.226370 86.23327
9 7.392187 85.51699
5 7.965455 85.46091
6 8.329555 86.83667
Note that if you want the rownames to be consecutive after sorting you can simply do
row.names(test.set) <- NULL

How to prevent extrapolation using na.spline()

I'm having trouble with the na.spline() function in the zoo package. Although the documentation explicitly states that this is an interpolation function, the behaviour I'm getting includes extrapolation.
The following code reproduces the problem:
require(zoo)
vector <- c(NA,NA,NA,NA,NA,NA,5,NA,7,8,NA,NA)
na.spline(vector)
The output of this should be:
NA NA NA NA NA NA 5 6 7 8 NA NA
This would be interpolation of the internal NA, leaving the trailing NAs in place. But, instead I get:
-1 0 1 2 3 4 5 6 7 8 9 10
According to the documentation, this shouldn't happen. Is there some way to avoid extrapolation?
I recognise that in my example, I could use linear interpolation, but this is a MWE. Although I'm not necessarily wed to the na.spline() function, I need some way to interpolate using cubic splines.
This behavior appears to be coming from the stats::spline function, e.g.,
spline(seq_along(vector), vector, xout=seq_along(vector))$y
# [1] -1 0 1 2 3 4 5 6 7 8 9 10
Here is a work around, using the fact that na.approx strictly interpolates.
replace(na.spline(vector), is.na(na.approx(vector, na.rm=FALSE)), NA)
# [1] NA NA NA NA NA NA 5 6 7 8 NA NA
Edit
As #G.Grothendieck suggests in the comments below, another, no doubt more performant, way is:
na.spline(vector) + 0*na.approx(vector, na.rm = FALSE)

How to remove extra rows and columns with NA values when importing from csv file using R

I have just started learning R. I'm trying to input data from a .csv file and but R keeps adding extra rows and columns with NA values. Does anyone know why this might be happening? Any advice on removing these NA would be greatly appreciated. I have used the following code:
>no_col <- max(count.fields("6%AA_comp.csv", sep=","))
>mydata <- read.csv(file="6%AA_comp.csv", fill=TRUE, header=TRUE, col.names = 1:no_col-1)
>mydata
X0 X1 X2 X3 X4
1 206428 152160 122080 111940 NA
2 183620 148300 118820 107260 NA
3 169100 164480 151420 146200 NA
4 179000 135920 107340 93540 NA
5 213820 146640 113040 109140 NA
6 150920 141400 133600 132000 NA
7 185645 154000 124510 128900 NA
8 176102 139100 141000 110300 NA
9 159045 154350 121050 153500 NA
10 198610 161000 119000 105600 NA
11 183100 138900 141500 129550 NA
12 211050 142550 136700 113500 NA
13 167000 150100 120000 102540 NA
14 NA NA NA NA NA
15 NA NA NA NA NA
16 NA NA NA NA NA
Well, data cleansing is always half the job or more. What you can do is to read the file as it is and then clean it by indexing only the rows and columns you are interested in, in your case this would be:
mydata <- read.csv(file="6%AA_comp.csv", fill=TRUE, header=TRUE)
mydata <- mydata[1:13, 1:5]
This typically happens when you delete some rows from your csv file and then try and import the same.
If its a one off, the easiest solution will be to open the csv in excel and delete all the rows below the last data row.
Addressing the comment below, we can do something like this
NA.Count = function(x)
{
return(sum(is.na(x)))
}
Row.NA.Count = apply(MAT,1,NA.Count)
Idx = Row.NA.Count == ncol(MAT)
MAT = MAT[!Idx,]
where MAT is the imported matrix.
The above code will take care of all the empty rows. You can do a similar thing for the columns.
Hope this helps.

Resources