Replace the subsetted and modified data back into the main dataframe - r

I have a data frame with 20 rows, I randomly select n rows and modify them. How can I put the modified value back to the original data frame with only the modified value being different?
df<- data.frame(rnorm(n = 20, mean = 0, sd = 1))
n = 8
a<- data.frame(df[ c(1, sample(2:(nrow(df)-1), n), nrow(df) ), ])
a$changedvalue <- a[,1]*(2.5)
Now I want to replace the values of the original dataframe df with the values of a$changedvalue such that only the sampled values are changed while everything else is same in df. I tried doing something like this but it's not working.
df %>% a[order(as.numeric(rownames(a))),]
I just want to point out that in my original dataset the data are timeseries data, so maybe they can be used for the purpose.

Instead of writing data.frame(df[ c(1, sample(2:(nrow(df)-1), n), nrow(df) ), ])
You can define the rows you want to use, lets call it rows
rows <- c(1, sample(2:(nrow(df)-1), n), nrow(df) )
Now you can do
a<- data.frame(df[rows, ])
a$changedvalue <- a[,1]*(2.5)
df[rows, ] <- a$changedvalue

Related

How to get most frequent categorical value row wise from specific columns in R?

I have a data like time and date columns with respect to visit types.
Below is the sample data[1]
[1]: https://i.stack.imgur.com/GfnQb.png
From the above data I have to get maximum repeated value row wise.
I tried like below
out1$MAX1 <- do.call('pmax',c(out1[,2:5],list(na.rm=TRUE)))
Output of above code[2]
[2]: https://i.stack.imgur.com/CitAa.png
It is giving wrong values for some of the rows.For example in the above output for 3rd row we have "SFU","SFU","SFU","SFU,GFU".By using above maximum code getting maximum value as "SFU,GFU".But it has to get "SFU".And I have to add new column that howmany times that visit type is repeated(i.e, for the same 3rd row "SFU" is the maximum value and it is repeated 4 times).
How to achieve that?
Per your original question, you can get the most frequent categorical value row-wise by (1) creating a function that tables a vector and returns the most frequent; and (2) loop through your data frame, unlisting each row and calling your function:
# Create sample data
df <- data.frame(
id = 1:5,
v1 = sample(c("SFU", "GFU"), replace = TRUE, 5),
v2 = sample(c("SFU", "GFU"), replace = TRUE, 5),
v3 = sample(c("SFU", "GFU"), replace = TRUE, 5),
v4 = sample(c("SFU", "GFU"), replace = TRUE, 5),
stringsAsFactors = FALSE
)
# Create function
get_most_frequent <- function(x) {
tab <- sort(table(x))
out <- names(tail(tab, 1))
out
}
# Loop through data frame
df$most_frequent <- vector(mode = "character", length = nrow(df))
for (i in 1:nrow(df)) {
r <- unlist(df[i,2:5])
df$most_frequent[i] <- get_most_frequent(r)
}
If you need to split up an instance like "SFU, GFU", you can adjust your function accordingly to split the strings by comma.

Delete rows after a negative value in multiple data frames

I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))

How to find values less than -1 in each row for every 12 columns in R?

I have a matrix(100*120) and I am trying to find values <=-1 in each row for every 12 columns. I have tried several times but failed. It is easy to find values which are <= -1, but I do not know how to consider for every 12 columns and store the results for each row. Thanks for any help.
set.seed(100)
Mydata <- sample(x=-3:3,size = 100*120,replace = T)
Mydata <- matrix(data = Mydata,nrow = 100,ncol = 120)
results <- which(Mydata<=-1,arr.ind = T)
You can use the apply function to apply the which function across each column for each row at a time. If I misinterpreted what you wanted, you can adjust the MARGIN argument accordingly.
# MARGIN=1 to apply across rows
dd <- apply(Mydata,MARGIN=1,function(x) which(x <= -1))
dd[1] # which columns in row 1 have a value <= -1
You can do this using a combination of apply functions and seq()
#Example Data
set.seed(100)
Mydata <- sample(x=-3:3,size = 100*120,replace = T)
Mydata <- matrix(data = Mydata,nrow = 100,ncol = 120)
#Solution:
Myseq <- sapply(0:9,function(x) seq(1,12,1) + 12*x)
sapply(1:dim(Myseq)[2], function(x) which(Mydata[,Myseq[,x]] == -1))
This results in a list with:
each subset of the list representing one of your 10 groups of 12 columns
each value under each subset representing the position in the matrix of any value in those 12 columns with a value equal to -1.

Keeping column names when deleting columns

I've created a R script that calculates the percentage of missing values in each column of a data frame, and then removes the columns that exceed a preset threshold. The column names need to be maintained.
The names are maintained when there is more than one column in the data frame after column deletion, but not when there is only one column.
Code of when column names stay the same
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10), C=rnorm(10, 10, 1))
threshold <- 80
pmiss <- function(x) {
ifelse(sum(is.na(x))/length(x)*100 > threshold, TRUE, FALSE)
}
temp <- sapply(df, pmiss)
deletecols <- names(temp[temp==TRUE])
df <- as.data.frame(df[,!(names(df) %in% deletecols)])
names(df) #prints
[1] "A" "C"
However, define df as
df <- data.frame(A=rnorm(10, 10, 1), B=rep(NA, 10))
and
names(df) #prints
[1] "df[, !(names(df) %in% deletecols)]"
Does anybody know why the column names are not kept when there is only one column?
You been bitten by an R FAQ. Add ,drop = FALSE to your data frame subsetting (and you notice as a side-effect that you no longer need as.data.frame.)

Sorting and finding values in other data frames

I have a dataframe named commodities_3. It contains 28 columns with different commodities and 403 rows representing end-of-month data. What I need is to find the position for each row separately:
max value,
min value,
all other positives
all other negatives
Those index should then be used to locate the corresponding data in another dataframe with the same column and row characteristics called commodities_3_returns. These data should then be copied into 4 new dataframes (one dataframe for each sorting).
I know how to find the positions of the values for each row using which and which.min and which.max. But I don't know how to put this in a loop in order to do it for all 403 rows. And subsequently how to use this data to locate the corresponding data in the other dataframe commodities_3_returns.
Unfortunaltey I have to use a dataframe because I have dates as rownames in there, which I have to keep as I need them later for indexing, as well as NA's. It looks about like this:
commodities_3 <- as.data.frame(matrix(rnorm(15), nrow=5, ncol=3))
mydates <- as.Date(c("2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2011-01-05"))
rownames(commodities_3) <- mydates
commodities_3[3,2] <- NA
commodities_3_returns <- as.data.frame(matrix(rnorm(15), nrow=5, ncol=3))
mydates <- as.Date(c("2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2011-01-05"))
rownames(commodities_3_returns) <- mydates
commodities_3_returns[3,3] <- NA
As I said, I have in total 403 rows and 27 columns. In every row, there are some NA's which I have to keep as well. max.col doesn't seem to be able to handle NA's.
My desired output for the above mentioned example would be sth like this:
max_values <- as.data.frame(matrix(data=c(1:5,3,2,1,3,1), nrow=5, ncol=2, byrow=F))
If all the columns in commodities_3 are numeric, then you want a matrix, not a data frame. Then use the apply function. Some sample data, for reprodcubililty.
commodities_3 <- matrix(rnorm(12), nrow = 4)
commodities_3_returns <- matrix(1:12, nrow = 4)
The stats.
mins <- apply(commodities_3, 1, which.min)
maxs <- apply(commodities_3, 1, which.min)
pos <- apply(commodities_3, 1, function(x) which(x > 0)) #which is optional
neg <- apply(commodities_3, 1, function(x) which(x < 0))
Now use these in the index for commodities_3_returns. In the absence of coffee, my brain has only a clunky solution with a for loop
n_months <- nrow(commodities_3_returns)
min_returns <- numeric(n_months)
for(i in seq_len(n_months))
{
min_returns[i] <- commodities_3_returns[i, mins[i]]
}
Here is an alternate approach to get the min and max using max.col which is a C function internally. If you have a large data set, max.col works extremely fast compared to apply based solutions
mins = max.col(-commodities_3)
maxs = max.col(commodities_3)
N = NROW(commodities_3)
commodities_3_returns[cbind(1:N, mins)] # returns min
commodities_3_returns[cbind(1:N, maxs)] # returns max

Resources