R: insert rows at specific places in dataframe - r

I can't seem to find an example to help me solve a particular problem in R. I have a data frame that looks like this:
tmp = data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
In reality I have thousands of columns and rows with many different values for group. The rows in the data frame are ordered by group.
I'd like to insert a new row above the first occurrence of each group. I'd also like for these new rows to only contain a value (the same value) in the first column (although I can make do if columns 2:ncol(tmp) contain NAs). Using the example data frame above, the end result should look like this:
group value
GROUP
A -1.7596279
A -0.8273928
A -0.3515738
A -0.7547999
A 0.5700747
GROUP
B -1.9676482
B 0.3996858
GROUP
C 0.1047832
C 0.5903711
C -1.3687259
C 0.3688415
C 1.3674403
C 0.8880089
Is there a way to do this? I can come up with a list of rows containing the first instance of each group. I was originally thinking that I could use this information to define where new rows should be inserted, but not sure if this is the best way to go.

I tried to create a function that does what you want it to do:
addEmptyRows <- function(D)
{
output <- tmp
i <- 1
while (i < NROW(output)) {
if(output$group[i] != output$group[i+1])
{
output <- rbind(output[1:i,],c("GROUP","NA"),output[(i+1):NROW(output),])
i <- i+1
}
i <- i+1
}
return(rbind(c("GROUP","NA"),output))
}
If you apply this function to your dataframe:
addEmptyRows(tmp)
It gives you the desired dataframe. Does this help you?

You could use something like this:
tmp <- data.frame(group = c(rep("A", 5), rep("B",2), rep("C",6)), value = rnorm(13))
divider <- data.frame(group = "GROUP", value = NA)
do.call(rbind, unlist(lapply(split(tmp, tmp$group),
function(x) list(divider, x)), recursive = F))

Related

R, how to use for loop to make each element in the list executed by the same command

R, how to use for loop to make each element in the list executed by the same command
for example, datanewV_102,datanewV_103,datanewV_105 are all data frame with 100 observation and 50 columns.
I want to drop the same column, the best 6th column, for all three data frame
The desired result: ( no new data frame, edit on the existing data frame.)
datanewV_102,datanewV_103,datanewV_105 becomes data frames with 100 observation and 49 columns.
My code was like:
vlist12 = list(datanewV_102,datanewV_103,datanewV_105)
for (v12 in vlist12) {
v12 = v12[-6]
}
However, in my result datanewV_102,datanewV_103,datanewV_105 remain the same. The v12 is the copy of datanewV_105 and with the 6th column dropped.
How can I revise that? if must use for loop?
You need your output to also be a list, to be able to store all these separate dataframes (in your current version, v12 gets overwritten at each occurrence of the loop).
It is possible with a for loop, if you loop on the index. For example:
datanewV_102 <- data.frame(x = letters[1:10],
y = LETTERS[1:10])
datanewV_103 <- data.frame(x = letters[1:10],
y = LETTERS[1:10])
datanewV_105 <- data.frame(x = letters[1:10],
y = LETTERS[1:10])
datanew_list <- list(datanewV_102,datanewV_103,datanewV_105)
output_list1 <- list()
for(i in seq_along(datanew_list)){
cur_df <- datanew_list[[i]]
output_list1[[i]] <- cur_df[-2]
}
But here is typically a case where lapply makes things easier to read. For example:
remove_col <- function(df){
df[-2]
}
output_list2 <- lapply(datanew_list, remove_col)
The two solutions are equivalent:
all.equal(output_list1, output_list2)
#> [1] TRUE

How can I create a function to generate new variables based on values in different dataframe in R

I would like to create a function like this (obviously not proper code):
forEach ID in DATAFRAME1 look at each row with ID in DATAFRAME2 {
if DATAFRAME2$VARIABLE1 = something {
DATAFRAME1$VARIABLE1 = TRUE;
DATAFRAME1$VARIABLE2 = DATAFRAME2$VARIABLE2
}
}
In plain text, I've got a list of individuals and a database with mixed information on these
individuals. Let's say DATAFRAME2 contains informations on books read c(id, title, author, date). I want to create a new variable in DATAFRAME1 with a boolean of if the individual has read a specific book (VARIABLE1 above) and the date they first read it (VARIABLE2above). Also adding a third variable with number of times read would be interesting but not neccesary.
I haven't really done this in R before, mostly doing basic statistics and basic wrangling with dplyr. I guess I could use dplyr and join but this feels like a better approach. Any help to get me started would be much appreciated.
The following function does what the question asks for. Its arguments are
DF1 and DF2 have an obvious meaning;
var1 and var2 are VARIABLE1 and VARIABLE2 in the question;
value is the value of something.
The test data is at the end.
fun <- function(DF1, DF2, ID = 'ID', var1, var2, value){
DF1[[var1]] <- NA
DF1[[var2]] <- NA
k <- DF2[[var1]] == value
for(id in df1[[ID]]){
i <- DF1[[ID]] == id
j <- DF2[[ID]] == id
if(any(j & k)){
DF1[[var1]][i] <- TRUE
DF1[[var2]][i] <- DF2[[var2]][j & k]
}
}
DF1
}
fun(df1, df2, value = 4, var1 = 'X', var2 = 'Y')
# ID X Y
#1 a NA NA
#2 d TRUE 19
Test data.
set.seed(1234)
df1 <- data.frame(ID = c("a", "d"))
df2 <- data.frame(ID = rep(letters[1:5], 4),
X = sample(20, 20, TRUE),
Y = sample(20))

How to extract rows of a data frame between two characters

I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)

Replace value from dataframe column with value from keyvalue lookup

I want to replace certain values in a data frame column with values from a lookup table. I have the values in a list, stuff.kv, and many values are stored in the list (but some may not be).
stuff.kv <- list()
stuff.kv[["one"]] <- "thing"
stuff.kv[["two"]] <- "another"
#etc
I have a dataframe, df, which has multiple columns (say 20), with assorted names. I want to replace the contents of the column named 'stuff' with values from 'lookup'.
I have tried building various apply methods, but nothing has worked.
I built a function, which process a list of items and returns the mutated list,
stuff.lookup <- function(x) {
for( n in 1:length(x) ) {
if( !is.null( stuff.kv[[x[n]]] ) ) x[n] <- stuff.kv[[x[n]]]
}
return( x )
}
unlist(lapply(df$stuff, stuff.lookup))
The apply syntax is bedeviling me.
Since you made such a nice lookup table, You can just use it to change the values. No loops or apply needed.
## Sample Data
set.seed(1234)
DF = data.frame(stuff = sample(c("one", "two"), 8, replace=TRUE))
## Make the change
DF$stuff = unlist(stuff.kv[DF$stuff])
DF
stuff
1 thing
2 another
3 another
4 another
5 another
6 another
7 thing
8 thing
Below is a more general solution building on #G5W's answer as it doesn't cover the case where your original data frame has values that don't exist in the lookup table (which would result in length mismatch error):
library(dplyr)
stuff.kv <- list(one = "another", two = "thing")
df <- data_frame(
stuff = rep(c("one", "two", "three"), each = 3)
)
df <- df %>%
mutate(stuff = paste(stuff.kv[stuff]))

R loops: Adding a column to a table if does not already exist

I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.
After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA

Resources