I've got a matrices list created as following:
#create the database
vect_date <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
vect <- c(48,40,32,36,37,37,20,15,15,24,24,10,10,10)
vect <- as.data.frame(cbind(vect_date, vect))
vect <- vect[order(vect$vect_date),]
#create levels depending on vect$vect value
vect$level <- 1
for(i in 2:length(vect$vect)){vect$level[i] <- ifelse(vect$vect[i]==vect$vect[i-1], vect$level[i- 1],vect$level[i-1]+1)}
#create the list
monotone <- split(vect, f=vect$level)
Now, I would like to change vect$vect value of each of these levels/matrices depending on the vect$vect value of the subsequent matrix. I guess the issue consists of indexing elements and using for loops, but I don't know how to do that.
As an example, I would like to change the value of vect$vect depending on the fact that the subsequent is 10. In that case, the vect$vect value of that level should be multiplied by 100, obtaining:
vect <- c(48,40,37,36,37,37,20,15,15,2400,2400,10,10,10)
Any help would be great!
I think you can use factor in R first to get your levels:
vect_date <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
vect <- c(48,40,32,36,37,37,20,15,15,24,24,10,10,10)
vect <- as.data.frame(cbind(vect_date, vect))
vect <- vect[order(vect$vect_date),]
vect$level = factor(vect$vect,levels=unique(vect$vect))
vect$level = as.numeric(vect$level)
So if we want to change the level that comes before vect that have values 10, we can do:
level_tochange = vect$level[vect$vect==10] - 1
level_tochange
[1] 8 8 8
This tells us we need to change rows with level == 8. Note I use %in% because in events where you have more than 2 levels with vect==10, this will still work:
rows_tochange = which(vect$level %in% level_tochange)
vect$vect[rows_tochange] = vect$vect[rows_tochange]*100
vect
vect_date vect level
1 1 48 1
2 2 40 2
3 3 32 3
4 4 36 4
5 5 37 5
6 6 37 5
7 7 20 6
8 8 15 7
9 9 15 7
10 10 2400 8
11 11 2400 8
12 12 10 9
13 13 10 9
14 14 10 9
Related
I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA
Hello I am very new to the programming world and data science as well, and I am trying to work my way through it.
I am trying to assign values to the column in a data frame and using for loop such that the data frame is divided into ten groups and every row in every group is assigned a rank, such that row 1 to 10 is assigned as rank 1 and row 11 to 20 is assigned as rank 2 and so on. The original dimension of subset data set is 100 * 6
My data frame looks like
Data Frame
The codes I have written are:
x <- round(nrow(subset) / 10)
a=1
for(j in 1:10){
for(i in a:x){
subset[i, "rank"] = j
}
j = j + 1
a = x + 1
x = x * j
}
However, the loop runs infinitely and keeps on adding additional rows to the data frame. I had to manually stop the loop and the resulting dimension of the subset data frame was 17926 * 6.
Please help me understand where am I going wrong in writing the loop.
P.S. subset is a data frame name and not the subset function in R
Thanks in Advance !!
It might be better for you to start working with vectorized calculations instead of loops. This will help you in the future.
For example:
df <- data.frame(x = 1:100)
df$rank <- (df$x-1)%/%10 + 1
df
results in:
x rank
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 3
22 22 3
23 23 3
24 24 3
25 25 3
How about something like this:
subset$Rank <- ceiling(as.numeric(rownames(subset))/10)
The as.numeric converts the rowname into a number, dividing it by 10 and rounding up should give you what you need? Let me know if I've misunderstood.
I have 2 atomic vectors:
mcc <- as.character(c(1:10))
ctyc <- as.character(c(2:11))
And i have a data frame:
xmcc <- as.character(c(8:12))
xctyc <- as.character(c(1:4,12))
df <- data.frame(xmcc, xctyc)
colnames(df) <- c("mcc", "ctyc")
mcc ctyc
1 8 1
2 9 2
3 10 3
4 11 4
5 12 12
My desired output is :
logic is that if mcc in the dataframe do exist in the vector- mcc, it will return the mcc, 9999 otherwise. Same logic for column ctyc
mcc ctyc mccNew ctycNew
1 8 1 8 999
2 9 2 9 2
3 10 3 10 3
4 11 4 9999 4
5 12 12 9999 999
My attempt:
df$mccNew <- ifelse(df$mcc %in% mcc, df$mcc, "9999")
df$ctycNew <- ifelse(df$ctyc %in% ctyc, df$ctyc, "999")
While it can't shown to desired output.
We can use match to accomplish this:
A match B: produce an index vector where index[i] represent the location in B matched with A[i], NA if not.
So:
> matchedIndex.mcc <- match(df$mcc, mcc)
> matchedIndex.ctyc <- match(df$ctyc, ctyc)
> df$mccNew <- ifelse(!is.na(matchedIndex.mcc), mcc[matchedIndex.mcc], 9999)
> df$ctycNew <- ifelse(!is.na(matchedIndex.ctyc), ctyc[matchedIndex.ctyc], 9999)
> df
mcc ctyc mccNew ctycNew
1 8 1 8 9999
2 9 2 9 2
3 10 3 10 3
4 11 4 9999 4
5 12 12 9999 9999
You can use Map to add both variables in a single line like this
df[c("mccNew", "ctycNew")] <- Map(function(x, y) ifelse(x %in% y, x, "9999"),
df, list(mcc, ctyc))
Here, the left hand side provides slots with variable names to add to the data.frame. The right hand side runs in parallel between elements of two lists, a list of the data.frame variables and a list of the vectors that you use for checking. Map outputs a list with of length equal to the two list arguments, each list element containing a vector the length of the rows of df. Note that if your data.frame has more variables, you will want to subset to the variables of interest in the second argument to Map.
This returns
df
mcc ctyc mccNew ctycNew
1 8 1 4 9999
2 9 2 5 3
3 10 3 1 4
4 11 4 9999 5
5 12 12 9999 9999
I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8