Apply na.approx on row and output dataframe in R - r

Given the following dataframe:
df1 <- data.frame(Company = c('A','B','C','D','E'), `X1980` = c(NA, 5, 3, 8, 13),
`X1981` = c(20, NA, 23, 11, 29),
`X1982` = c(NA, 32, NA, 41, 42),
`X1983` = c(45, 47, 53, 58, NA))
I would like to replace the NA's with values through interpolation over the rows resulting in the following data frame:
Company 1980 1981 1982 1983
A NA 20 32,5 45
B 5 18,5 32 47
C 3 23 38 53
D 8 11 41 58
E 13 29 42 NA
I tried using na.apporox in combination with apply:
df1[-1] <- t(apply(df1[-1], 1, FUN = na.approx))
But this results in the following error:
Fehler in h(simpleError(msg, call)) :
Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 't': dim(X) muss positive Länge haben
Thanks in advance for any help!!!
EDIT:
I forgot to define how to treat the NA's in the na.approx function.
df1[-1] <- t(apply(df1[-1], 1, na.approx, na.rm=FALSE))
This results in the desired output!

This interpolation methods, from what i know, don't deal with NA's at the end of the sequence, which is the case of entry Ex1983. So, na.approx (at least the one from zoo package) gives an optional argument na.rm for you to select if you want to remove those NA's or not.
By default it is TRUE, so it removes, giving your matrix a 4th column that only has 3 elements, not 4 (as the rest of the columns), so it produces that error. To deal with that you simply set that argument to FALSE:
df1[-1] <- t(apply(df1[-1], 1, FUN = zoo::na.approx, na.rm=FALSE))
Output:
> df1
Company X1980 X1981 X1982 X1983
1 A NA 20 NA 45
2 B 5 NA 32 47
3 C 3 23 NA 53
4 D 8 11 41 58
5 E 13 29 42 NA
To fill that NA, i can help you use the answers from here, if you need me to.

Related

Selecting specific sequence of rows from a matrix

I have a data matrix of order 2000 x 20, and I want to choose a specific order of entire rows from the matrix, for example, 1st,7th, 8th, 14th, 15th, 21th, 22th, is such a sequence of rows until the last 2000th rows.
[1, 7, 8, 14, 15, 21, 22, ...]
Manually it's very difficult to select such a sequence, is there is an alternative to do the same task in R? are the for looping is helpful in solving such a problem.
Using the updated question data, something like:
cumsum(rep(c(1,6), 2000/7))
# [1] 1 7 8 14 15
# ...
#[566] 1981 1982 1988 1989 1995
Since your pattern is +1/+6 up until 2000, you can repeat the two values c(1,6) as many times as sum(c(1,6)) goes in to 2000, and then take a cumulative sum.
Try this
mat[sort(c(k <- seq(6, 2000, by = 7), k + 1)),]
You can define your sequence first, using e.g. sequence, and then subset using [].
n = 2000 / 7
s = sequence(nvec = c(1, rep(2,n)), from = c(1, 7*1:n))
# s
# [1] 1 7 8 14 15 21 22 28 29 35 36 42 43 49 50 56 57 63 64 70 71 ...
yourMatrix[s, ]
sequence creates a sequence of sequences of length nvec and of starting point from.

Cleaning Data & Association Rules - R

I am trying to tidy the following dataset (in link) in R and then run an association rules below.
https://www.kaggle.com/fanatiks/shopping-cart
install.packages("dplyr")
library(dplyr)
df <- read.csv("Groceries (2).csv", header = F, stringsAsFactors = F, na.strings=c(""," ","NA"))
install.packages("stringr")
library(stringr)
temp1<- (str_extract(df$V1, "[a-z]+"))
temp2<- (str_extract(df$V1, "[^a-z]+"))
df<- cbind(temp1,df)
df[2] <- NULL
df[35] <- NULL
View(df)
summary(df)
str(df)
trans <- as(df,"transactions")
I get the following error when I run the above trans <- as(df,"transactions") code:
Warning message:
Column(s) 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 not logical or factor. Applying default discretization (see '? discretizeDF').
summary(trans)
When I run the above code, I get the following:
transactions as itemMatrix in sparse format with
1499 rows (elements/itemsets/transactions) and
1268 columns (items) and a density of 0.01529042
most frequent items:
V5= vegetables V6= vegetables temp1=vegetables V2= vegetables
140 113 109 108
V9= vegetables (Other)
103 28490
The attached results is showing all the vegetable values as separate items instead of a combined vegetable score which is obviously increasing my number of columns. I am not sure why this is happening?
fit<-apriori(trans,parameter=list(support=0.006,confidence=0.25,minlen=2))
fit<-sort(fit,by="support")
inspect(head(fit))
For coercion to transaction class the dataframe needs to be made up of factor columns. You have a dataframe of characters - hence the error message. The data requires some further cleaning in order to get it to coerce properly.
I'm not very familiar with the arules package but I believe the read.transactions function may be more useful as it would automatically discard duplicates. I found it easiest to make a binary matrix and use a for loop, but I am sure there is a neater solution.
Continuing on directly from your code:
items <- as.character(unique(unlist(df))) # get all unique items
items <- items[which(str_detect(items, "[a-z]"))] # remove numbers
trans <- matrix(0, nrow = nrow(df), ncol = length(items))
for(i in 1:nrow(df)){
trans[i,which(items %in% t(df[i,]))] <- 1
}
colnames(trans) <- items
rownames(trans) <- temp2
trans <- as(trans, "transactions")
summary(trans)
Giving
transactions as itemMatrix in sparse format with
1637 rows (elements/itemsets/transactions) and
38 columns (items) and a density of 0.3359965
most frequent items:
vegetables poultry waffles ice cream lunch meat (Other)
1058 582 562 556 555 17588
element (itemset/transaction) length distribution:
sizes
0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
102 36 8 57 51 51 71 69 63 80 79 58 84 91 72 105 97 87 114 91 82 46 30 7 4 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 8.00 14.00 12.77 18.00 26.00
includes extended item information - examples:
labels
1 pork
2 shampoo
3 juice
includes extended transaction information - examples:
transactionID
1 1/1/2000
2 1/1/2000
3 2/1/2000

Print levels of a factor present within select criteria rather than all levels of the factor in R?

I am working with R version i386 3.1.1 and RStudio 0.99.442.
I have large datasets of tree species that I've collected from 7 plots, each of which are divided into 5 subplots (i.e. 35 distinct subplots). I am trying to get R to run through my dataset and print the species which are present within each plot.
I thought I could use "aggregate" to apply the "levels" function to the Species data column and have it return the Species present for each Plot and Subplot, however it returns the levels of the entire data frame (for 12 species, total) rather than the 3 or 4 species that are actually present in the Subplot.
To provide a reproducible example of what I'm trying to do, we can use the "warpbreaks" dataset that comes with R.
I convert the 'breaks' variable in warpbreaks to a factor variable to recreate the problem; It thus exemplifies my 'species' variable, whereas 'warpbreaks$wool' would represent 'plot', and 'warpbreaks$tension' would represent 'subplot'.
require(stats)
warpbreaks$breaks = as.factor(warpbreaks$breaks)
aggregate(breaks ~ wool + tension, data = warpbreaks, FUN="levels")
If we look at the warpbreaks data, then for "Plot" A (wool) and "Subplot" L (tension) - the desired script would print the species "26, 30, 54, 25, etc."
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
7 51 A L
8 26 A L
9 67 A L
10 18 A M
11 21 A M
12 29 A M
...
Instead, R returns something of this sort, where it is printing ALL of the levels of the factor variable for ALL of the plots:
wool tension breaks.1 breaks.2 breaks.3 breaks.4 breaks.5 breaks...
1 A L 10 12 13 14 15 ...
2 B L 10 12 13 14 15 ...
3 A M 10 12 13 14 15 ...
4 B M 10 12 13 14 15 ...
5 A H 10 12 13 14 15 ...
6 B H 10 12 13 14 15 ...
How do I get it to print only the factors that are present within that Plot/Subplot combination? Am I totally off in my use of "aggregate"? I'd imagine this is a relatively easy task for an experience R user...
First time stackoverflow post - would appreciate any help or nudges towards the right code!
Many kind thanks.
Try FUN=unique rather than FUN=levels. levels will return every level of the factor, as you have surmised already. unique(...) will only return the unique levels.
y <- aggregate(breaks ~ wool + tension, data = warpbreaks, FUN=unique)
wool tension breaks
1 A L 14, 18, 29, 13, 31, 28, 27, 30
2 B L 15, 4, 17, 9, 19, 23, 10, 26
3 A M 8, 11, 17, 7, 2, 20, 18, 21
4 B M 24, 14, 9, 6, 22, 16, 11, 17
5 A H 21, 11, 12, 8, 1, 25, 16, 5, 14
6 B H 10, 11, 12, 7, 3, 5, 6, 16
NOTE the breaks column is a little weird, as in each row of that column instead of having one value (which makes sense for a dataframe), you have a vector of values. i.e. each cell of that breaks column is NOT a string; it's a vector!
> class(y$wool)
[1] "factor"
> class(y$breaks) # list !!
[1] "list"
> y$breaks[[1]] # first row in breaks
[1] 26 30 54 25 70 52 51 67
Levels: 10 12 13 14 15 16 17 18 19 20 21 24 25 26 27 28 29 30 31 35 36 39 41 42 43 44 51 52 54 67 70
Note that to access the first element of the breaks column, instead of doing y$breaks[1] (like you would with the wool or tension column) you need to do y$breaks[[1]] because of this.
Data frames are not really meant to work like this; a single cell in a dataframe is supposed to have a single value, and most functions will expect a dataframe to conform to this, so just keep this in mind when doing future processing.
If you wanted to convert to a string, use (e.g.) FUN=function (x) paste(unique(x), collapse=', '); then y$breaks will be a column of strings and behaves as normal.

Regroup lines of a data frame for which a column value is inferior to x

I have this data frame :
> df
Z freq proba
1 17 1 0.0033289263
2 18 4 0.0055569026
3 19 2 0.0087878028
4 20 3 0.0132023556
5 21 16 0.0188900561
6 22 12 0.0257995234
7 23 30 0.0337042731
8 24 41 0.0421963455
9 25 56 0.0507149437
10 26 65 0.0586089198
11 27 65 0.0652230449
12 28 93 0.0699913154
13 29 82 0.0725182432
14 30 94 0.0726318551
15 31 72 0.0703990113
16 32 74 0.0661024717
17 33 58 0.0601873020
18 34 66 0.0531896431
19 35 38 0.0456625487
20 36 45 0.0381117389
21 37 27 0.0309498221
22 38 17 0.0244723502
23 39 15 0.0188543771
24 40 13 0.0141629367
25 41 4 0.0103793600
26 42 1 0.0074254435
27 43 2 0.0051886582
28 45 1 0.0023658767
29 46 1 0.0015453804
30 49 2 0.0003792308
# Here are my datas :
> dput(df)
structure(list(Z = c(17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 45, 46, 49), freq = c(1, 4, 2, 3, 16, 12, 30, 41, 56, 65,
65, 93, 82, 94, 72, 74, 58, 66, 38, 45, 27, 17, 15, 13, 4, 1,
2, 1, 1, 2), proba = c(0.0033289262662263, 0.00555690264007235,
0.00878780282243439, 0.0132023555702843, 0.0188900560866825,
0.0257995234198431, 0.0337042730520012, 0.0421963455163949, 0.0507149437492447,
0.0586089198012906, 0.0652230449359029, 0.0699913153996099, 0.0725182432348992,
0.0726318551493006, 0.0703990113442269, 0.0661024716831246, 0.0601873020200862,
0.0531896430528685, 0.045662548708844, 0.0381117389181843, 0.030949822142559,
0.0244723501557229, 0.01885437705459, 0.0141629366839816, 0.0103793599644779,
0.00742544354411115, 0.00518865818999788, 0.00236587669133322,
0.00154538036835848, 0.000379230768851682)), .Names = c("Z",
"freq", "proba"), row.names = c(NA, -30L), class = "data.frame")
And I want to regroup lines for which the value "freq" is < 5 with the next line, and this while the next line is < 5.
Idk if I'm clear enough so this is the output I expect :
> df2
labels effectifs pi
1 17;20 10 0.03087599
2 21 16 0.01889006
3 22 12 0.02579952
4 23 30 0.03370427
5 24 41 0.04219635
6 25 56 0.05071494
7 26 65 0.05860892
8 27 65 0.06522304
9 28 93 0.06999132
10 29 82 0.07251824
11 30 94 0.07263186
12 31 72 0.07039901
13 32 74 0.06610247
14 33 58 0.06018730
15 34 66 0.05318964
16 35 38 0.04566255
17 36 45 0.03811174
18 37 27 0.03094982
19 38 17 0.02447235
20 39 15 0.01885438
21 40 13 0.01416294
22 41;49 11 0.02728395
I did it with nested while, but I find this solution very painful and so unoptimized.
i <- 1
freqs <- c()
labels <- c()
pi <- c()
while(i < nrow(df)) {
if (df$freq[i] >= 5) {
freqs <- c(freqs, df$freq[i])
labels <- c(labels, df$Z[i])
pi <- c(pi, df$proba[i])
i <- i + 1
}
else {
count <- df$freq[i]
countPi <- df$proba[i]
k <- i
j <- i
while(df$freq[i] < 5 & i < nrow(df)) {
if (df$freq[i+1] < 5) {
count <- count + df$freq[i+1]
countPi <- countPi + df$proba[i+1]
j <- i + 1
}
i <- i + 1
}
labels <- c(labels, paste0(df$Z[k], ";", df$Z[j]))
freqs <- c(freqs, count)
pi <- c(pi, countPi)
}
}
df2 <- data.frame(labels, freqs, pi)
I'm sure there is far better, maybe with dplyr. If you have a better solution.. Thanks !
We could use the "devel" version of "data.table" as new functions are introduced (rleid). Here, we convert the "data.frame" to "data.table" (setDT(df)), create a grouping variable ("gr") based on the logical index (freq <5) using rleid. 'Z' column is 'numeric/integer' class. Create a character column ("Z1") from the "Z". Grouped by 'gr', if the "freq" is less than 5 for all the elements of that group, summarise the rows to a single row by taking the first observation of columns (.SD[1L]), remove the unwanted columns (as .SD includes "Z1" which will result in duplicate columns), append it with the "Z1" that we get from pasting the min and max value of "Z" for that group. Otherwise, leave it unchanged (else .SD). Remove the columns that we don't need by assigning it to "NULL".
library(data.table) #data.table_1.9.5
res <- setDT(df)[, gr:=rleid(freq<5)][, Z1:= as.character(Z)][,
if(all(freq<5)) c(.SD[1L][,-4, with=FALSE],
list(Z1=toString(c(min(Z), max(Z)))))
else .SD, gr][,1:2 :=NULL][]
head(res,3)
# freq proba Z1
#1: 1 0.003328926 17, 20
#2: 16 0.018890056 21
#3: 12 0.025799523 22
Since this is a dplyr question, here is a dplyr solution. First I used a grouping function in order to define the groups (similar to the rleid function in data.table). Then the summary and is fairly simple.
# grouping function
grouping <- function(condition){
# calculate runs for grouping
run <- rle((!condition) * 1:length(condition))
# revalue
run$values <- seq_along(run$values)
# invert to get grouping
inverse.rle(run)
}
# load dplyr
require(dplyr)
df %>%
mutate(group = grouping(freq<5)) %>% # add groups
group_by(group) %>% # group data
summarize(freq = sum(freq), # sum freq
proba = sum(proba), # sum proba
Z = toString(unique(range(Z)))) %>% # rename Z
mutate(group=NULL) # remove groups
## Source: local data table [22 x 3]
##
## freq proba Z
## 1 10 0.03087599 17, 20
## 2 16 0.01889006 21
## 3 12 0.02579952 22
## 4 30 0.03370427 23
## 5 41 0.04219635 24
## 6 56 0.05071494 25
## 7 65 0.05860892 26
## 8 65 0.06522304 27
## 9 93 0.06999132 28
## 10 82 0.07251824 29
## .. ... ... ...

create a new column with multiple categories using two columns when they satisfy certain conditions using R

I have a data set of "X" (values from 0 to 80) and "Y" (values from 0 to 80). I would like to create a new column "Table". I have 36 tables in mind: In groups of 6... They should be grouped according to:
Tables 1-6:ALL Y 11-20... Table 7-12:Y 21-30, Table 13-18:Y 31-40, Table 19-24:Y 41-50, Table 25-30:Y 51-60, Table 31-36:Y 61-70
Table 1: X 21-30 and Tables 7, 13, 19, 25, 31
Table 2: X 31-40 and Tables 8, 14, 20, 26, 32
Table 3: X 41-50 and Tables 9, 15, 21, 27, 33
Table 4: X 51-60 and Tables 10, 16, 22, 28, 34
Table 5: X 61-70 and Tables 11, 17, 23, 29, 35
Table 6: X 71-80 and Tables 12, 18, 24, 30, 36
End Result:
X Y Table
45 13 3
66 59 29
21 70 31
17 66 NA (there is no table for X lower than 21)
Should I be using the If Else function to group the data from the "X" and "Y" into my new "Table", ranging from 1 to 36 or something else? Any help will be appreciated! Thank you!
head(data)
value avg.temp X Y
1 0 6.69 45 13
2 0 6.01 48 14
3 0 7.35 39 15
4 0 5.86 45 15
5 0 6.43 42 16
6 0 5.68 48 16
I think you could use something like this. If your data frame is called df :
df$Table <- NA
df$Table[df$X>=21 & df$X<=30 & df$Y>=11 & df$Y<=20] <- 1
df$Table[df$X>=31 & df$X<=40 & df$Y>=11 & df$Y<=20] <- 2
...
Use math and indexes:
# demo data
x <- data.frame(X = c(45,66,21,17,0,1,21,80,45),Y = c(13,59,70,66,80,11,0,1,27))
# if each GROUP of Y tables was numbered 1-6, aka indexing
x$ytableindex <- ((x$Y-1) - (x$Y-1) %% 10) / 10
# NA if too low
x$ytableindex[x$ytableindex < 1] <- NA
# find lowest table based on Y index
x$ytable <- (0:5*6+1)[x$ytableindex]
# find difference from lowest Y table to arrive at correct table using X
x$xdiff <- floor((x$X - 1) / 10 - 2)
# NA if too low
x$xdiff[x$xdiff < 0] <- NA
# use difference to calculate the correct table, NA's stay NA
x$Table <- x$ytable + x$xdiff

Resources