assign values from same row in R - r

I want to assign the row value of B to row A only when A = 1. This is what I have done so far:
Data frame:
df <- data.frame('A' = c(1,1,2,5,4,3,1,2), 'B' = c(100,200,200,200,100,200,100,200))
A B
1 1 100
2 1 200
3 2 200
4 5 200
5 4 100
6 3 200
7 1 100
8 2 200
Output:
df$A[df$A == 1] <- df$B
A B
1 100 100
2 200 200
3 2 200
4 5 200
5 4 100
6 3 200
7 200 100
8 2 200
As you can see, rows 1 and 2 do what they are supposed to do. However, row 7 doesn't, but instead takes the value from row 3 - it is assigning values sequentially.
My question: how do I assign values that takes the inputs from the same row?

Use:
df$A[df$A == 1] <- df$B[df$A == 1]
You need to apply the same index to both, column to be replaced and column that holds the replacements.

Related

How to determine when a change in value occurs in R

I am following this example from stack overflow: Identifying where value changes in R data.frame column
Theres two columns: ind and value. How do I identify the 'ind' when 'value' increases by 100?
For example,
Value increases by 100 at ind = 4.
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)), stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400
I tried this but it doesn't work:
miss <- function(x) ifelse(is.finite(x),x,NA)
value_xx =miss(min(df$ind[df$value[1:length(df$value)] >= 100], Inf, na.rm=TRUE))
Like this:
df$ind[c(FALSE, diff(as.numeric(df$value)) == 100)]
You can use diff to get difference between consecutive values and get the index for those where the difference is greater than equal to 100. Added + 1 to the index since diff returns vector which of length 1 shorter than the original one.
df$ind[which(diff(df$value) >= 100) + 1]
#[1] 4 7 9
In dplyr, you can use lag to get previous values :
library(dplyr)
df %>% filter(value - lag(value) >= 100)
# ind value
#1 4 200
#2 7 300
#3 9 400

How to subset a data.frame according to the values of last two rows?

###the original data
df1 <- data.frame(a=c(2,2,5,5,7), b=c(1,5,4,7,6))
df2 <- data.frame(a=c(2,2,5,5,7,7), b=c(1,5,4,7,6,3))
when the a column value of the last two rows are not equal (here the 4th row is not equal to the 5th row, namely, 5!=7), I want to subset the last row only.
#input
> df1
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
#output
> df1
a b
1 7 6
when the a column value of the last two rows are equal (here 5th row is equal to the 6th row, namely, 7=7, I want to subset the last two rows
#input
> df2
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
6 7 3
#output
> df2
a b
1 7 6
2 7 3
You can write a function to check last two row values for a column :
return_rows <- function(data) {
n <- nrow(data)
if(data$a[n] == data$a[n - 1])
tail(data, 2)
else tail(data, 1)
}
return_rows(df1)
# a b
#5 7 6
return_rows(df2)
# a b
#5 7 6
#6 7 3
try it this way
library(tidyverse)
df %>%
filter(a == last(a))
a b
5 7 6
a b
5 7 6
6 7 3
We can use subset from base R
subset(df1, a == a[length(a)])

Sort rows of data frame by shifting the rows so that the maximum value is on the top

I have a data frame like below, values of which needs to be sorted.
Name Bin Value
a 1 10
a 2 1000
a 3 1
a 4 100
b 1 20
b 2 2
b 3 200
b 4 2000
I wish that the maximum value goes to the top with keeping the relative position of values to the other values, so that the new order looks like below.
Name Bin Value
a 1 1000
a 2 1
a 3 100
a 4 10
b 1 2000
b 2 20
b 3 2
b 4 200
It is not just bring the maximum Value to the top, but the whole sequence of Value needs to be shifted with maximum Value like a 1 is always below a 1000 in both old and new data.frame.
Define a function which takes a vector and shifts it upwards moving the maximum to the top and then shifting the values before the maximum to the bottom. Use ave to apply that to Value by Name.
max2top <- function(x) {
wx <- which.max(x) - 1
if (wx == 0) x else c(tail(x, -wx), head(x, wx))
}
transform(DF, Value = ave(Value, Name, FUN = max2top))
giving
Name Bin Value
1 a 1 1000
2 a 2 1
3 a 3 100
4 a 4 10
5 b 1 2000
6 b 2 20
7 b 3 2
8 b 4 200
Note
The input in reproducible form:
Lines <- "Name Bin Value
a 1 10
a 2 1000
a 3 1
a 4 100
b 1 20
b 2 2
b 3 200
b 4 2000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)

Deleting groups that don't appear in every time period and data frame

I am cleaning data with multiple time periods in each data frame with multiple data frames. Each data frame has one year of data. I want to delete groups that do not appear in each time period(within the data frame), and delete groups that do not appear in each data frame. In other words, I want to keep groups that exist in every time period, across each data frame. I created data with an ID, time variable, and two variables that represent my data. My data will have more data frames, IDs, groups and variables as well.
t<-c(1,1,2,2,3,3,3,4,4,4)
id<-c(200,300,200,300,100,200,300,200,300,400)
x1<-rnorm(1:10)
x2<-rnorm(1:10)
?df
df<-data.frame(id,t,x1,x2)
t<-c(1,1,1,2,2,3,3,3,4,4)
id<-c(200,300,400,200,300,200,300,400,200,300)
x1<-rnorm(1:10)
x2<-rnorm(1:10)
df2<-data.frame(id,t,x1,x2)
id<-c(200,300,200,300,600,200,300,100,200,300)
t<-c(1,1,2,2,2,3,3,4,4,4)
x1<-rnorm(1:10)
x2<-rnorm(1:10)
df3<-data.frame(id,t,x1,x2)
rb<-rbind(df,df2,df3)
rb
cb<-cbind(df,df2,df3)
cb
id t x1 x2 id t x1 x2 id t x1 x2
1 200 1 0.37223136 -0.04918183 200 1 0.6489171399 -0.1324335 200 1 -0.41387676 -0.4566678425
2 300 1 -0.22062416 0.05150952 300 1 -0.3669090613 3.0826144 300 1 0.48237987 -0.0325861333
3 200 2 0.32912208 1.03922999 400 1 0.9347859735 0.1026632 200 2 -0.31308242 -0.3021501845
4 300 2 -0.18172302 -1.41669927 200 2 0.4814364147 -0.1087465 300 2 -1.52273626 0.6357750776
5 100 3 -0.81072008 0.64522238 300 2 -0.5676866296 0.2371230 600 2 -0.09687669 2.2883585934
6 200 3 0.45175343 0.64197622 200 3 0.0006852893 0.5830704 200 3 0.01726120 -0.5905109745
7 300 3 0.40465989 -0.70796588 300 3 -0.0008717189 -1.1969493 300 3 -0.18603781 0.3722390396
8 200 4 0.09852108 -1.76958443 400 3 0.9343534507 -1.3671447 100 4 -0.57308316 0.4749221706
9 300 4 -0.53951022 0.97306346 200 4 1.9176422485 0.9879788 200 4 0.40222133 0.3278821640
10 400 4 0.24271562 -1.37269617 300 4 1.4298971045 1.6095265 300 4 0.85799186 0.0006593401
My final out put would look like this:
id t x1 x2
200 1 0.37223136 -0.04918183
300 1 -0.22062416 0.05150952
200 2 0.32912208 1.03922999
300 2 -0.18172302 -1.41669927
200 3 0.45175343 0.64197622
300 3 0.40465989 -0.70796588
200 4 0.09852108 -1.76958443
300 4 -0.53951022 0.97306346
200 1 0.6489171399 -0.1324335
300 1 -0.3669090613 3.0826144
200 2 0.4814364147 -0.1087465
300 2 -0.5676866296 0.2371230
200 3 0.0006852893 0.5830704
300 3 -0.0008717189 -1.1969493
200 4 1.9176422485 0.9879788
300 4 1.4298971045 1.6095265
200 1 -0.41387676 -0.4566678425
300 1 0.48237987 -0.0325861333
200 2 -0.31308242 -0.3021501845
300 2 -1.52273626 0.6357750776
200 3 0.01726120 -0.5905109745
300 3 -0.18603781 0.3722390396
200 4 0.40222133 0.3278821640
300 4 0.85799186 0.0006593401
One strategy is to count the number of times each combination of id and t appears. If this is equals the maximum possible, then keep that id. (I used max to get the maximum possible combinations, but that only works if at least one id has appeared in every t.
I use adply from the plyr package here to replace your rbind step, because adply preserves information about which data frame each row came from (in the X1 column).
library(plyr)
rb <- adply(list(df, df2, df3), 1)
unique_combo <- unique(rb[, c("X1", "id", "t")])
## X1 id t
## 1 1 200 1
## 2 1 300 1
## 3 1 200 2
## 4 1 300 2
## 5 1 100 3
## 6 1 200 3
## 7 1 300 3
## 8 1 200 4
## 9 1 300 4
## 10 1 400 4
## 11 2 200 1
## 12 2 300 1 etc.
combos_per_id <- aggregate(t ~ id, FUN = length, data = unique_combo)
## id t
## 1 100 2
## 2 200 12
## 3 300 12
## 4 400 3
## 5 600 1
ids_you_want <- subset(combos_per_id, t == max(t))
## id t
## 2 200 12
## 3 300 12
rb[rb$id %in% ids_you_want$id, ]
## X1 id t x1 x2
## 1 1 200 1 0.41800060 -0.729280896
## 2 1 300 1 -1.26310444 0.649438361
## 3 1 200 2 1.75130801 0.340464369
## 4 1 300 2 -0.47751518 -1.396611139
## 6 1 200 3 -0.11537438 -1.483654622
## 7 1 300 3 -1.33689508 -1.219725112 etc.
Edit to handle another column
library(plyr)
t<-c(1,1,2,2,3,3,3,4,4,4)
id<-c(200,300,200,300,100,200,300,200,300,400)
x1<-rnorm(1:10)
x2<-rnorm(1:10)
r<-c("b","a","a","a","a","a","a","a","a","a")
df<-data.frame(id,t,x1,x2, r)
t<-c(1,1,1,2,2,3,3,3,4,4)
id<-c(200,300,400,200,300,200,300,400,200,300)
x1<-rnorm(1:10)
x2<-rnorm(1:10)
r<-c("a","a","a","a","a","a","a","a","a","a")
df2<-data.frame(id,t,x1,x2, r)
id<-c(200,300,200,300,600,200,300,100,200,300)
t<-c(1,1,2,2,2,3,3,4,4,4)
x1<-rnorm(1:10)
x2<-rnorm(1:10)
r<-c("a","a","a","a","a","a","a","a","a","a")
df3<-data.frame(id,t,x1,x2, r)
rb <- adply(list(df, df2, df3), 1)
unique_combo <- unique(rb[, c("X1", "id", "t", "r")])
(combos_per_id <- aggregate(t ~ id + r, FUN = length, data = unique_combo))
ids_you_want <- subset(combos_per_id, t == max(t))
rb[rb$id %in% ids_you_want$id, ]
This is a bit brute force, but should work:
frames <- list(df,df2,df3)
lu <- function(x)
length(unique(x))
timePeriodsPerDataframe <- sapply(frames,function(x)lu(x))
for(i in seq(length(frames))){
appearsInEveryTimePeriod <- tapply(frames[[i]]$id,
frames[[i]]$t,
lu) == timePeriodsPerDataframe[i]
if(i == 1)
IDsInEveryTimePeriod <- names(tmp[tmp])
else
IDsInEveryTimePeriod <- intersect(names(tmp[tmp]),IDsInEveryTimePeriod)
}
IDsInEveryTimePeriod <- as.numeric(IDsInEveryTimePeriod)

R add index column to data frame based on row values

This is a continuation of r - How to add row index to a data frame, based on combination of factors
I tried to replicate what I believe to be the desired results using the green checked answer and am consistently getting something other than expected. I am sure I am doing something really basic wrong, but can't seem to see it OR I've misunderstood what the desired state is.
The data from the original post:
temp <- data.frame(
Dim1 = c("A","A","A","A","A","A","B","B"),
Dim2 = c(100,100,100,100,200,200,100,200),
Value = sample(1:10, 8)
)
Then I ran the following code: temp$indexLength <- ave( 1:nrow(temp), temp$Dim1, factor( temp$Dim2), FUN=function(x) 1:length(x) )
and: temp$indexSeqAlong <- ave( 1:nrow(temp), temp$Dim1, factor( temp$Dim2), FUN=seq_along )
and then I created the following: temp$indexDesired <- c(1, 1, 1, 1, 2, 2, 3, 3)
...ending up with the data frame below:
Dim1 Dim2 Value indexLength indexSeqAlong indexDesired
1 A 100 6 1 1 1
2 A 100 2 2 2 1
3 A 100 9 3 3 1
4 A 100 8 4 4 1
5 A 200 10 1 1 2
6 A 200 4 2 2 2
7 B 100 3 1 1 3
8 B 200 5 1 1 4
If I can figure out what I'm not getting the desired index -- and assuming the code is extensible to more than 2 variables -- I should be all set. Thanks in advance!
If you use data.table, there is a "symbol" .GRP which records this information ( a simple group counter)
library(data.table)
DT <- data.table(temp)
DT[, index := .GRP, by = list(Dim1, Dim2)]
DT
# Dim1 Dim2 Value index
# 1: A 100 10 1
# 2: A 100 2 1
# 3: A 100 9 1
# 4: A 100 4 1
# 5: A 200 6 2
# 6: A 200 1 2
# 7: B 100 8 3
# 8: B 200 7 4
Once the values in teh first argument have been partitioned, there is no way that ave "knows" what order they have been passed. You want a method that can look at changes in values. The duplicated function is generic and has a data.frame method that looks at multiple columns:
temp$indexSeqAlong <- cumsum(!duplicated(temp[, 1:2]) )
temp
Dim1 Dim2 Value indexSeqAlong
1 A 100 8 1
2 A 100 2 1
3 A 100 7 1
4 A 100 3 1
5 A 200 5 2
6 A 200 1 2
7 B 100 4 3
8 B 200 10 4
Is extensible to as many columns as you want.

Resources