R Creating Dynamic variables from group aggregated set of DataFrames - r

My problem statement is I have a list of dataframes as df1,df2,df3.Data is like
df1
a,b,c,d
1,2,3,4
1,2,3,4
df2
a,b,c,d
1,2,3,4
1,2,3,4
Now, for these two dataframe I should create a new dataframe taking aggregated column of those two dataframes ,for that I am using below code
for(i in 1:2){
assign(paste(final_val,i,sep=''),sum(assign(paste(df,i,sep='')))$d*100)}
I am getting the error:
Error in assign(paste(hvp_route_dsct_clust, i, sep = "")) :
argument "value" is missing, with no default
My output should look like
final_val1 <- 800
final_val2 <- 800
And for those values final_val1,final_val2 I should be creating dataframe dynamicaly
Can anybody please help me on this

If we need to use assign, get the object names from the global environment with ls by specifying the pattern 'df' followed by one or more numbers (\\d+), create another vector of 'final_val's ('nm1'), loop through the sequence of 'nm1', assign each of the element in 'nm2' to the value we got from extracting the column 'd' of each 'df's multiplied by 100 and taking its sum.
nm1 <- ls(pattern = "df\\d+")
nm2 <- paste0("final_val", seq_along(nm1))
for(i in seq_along(nm1)){
assign(nm2[i], sum(get(nm1[i])$d*100))
}
final_val1
#[1] 800
final_val2
#[1] 800
Otherwise, we place the datasets in a list, extract the 'd' column, multiply with 100 and do the column sums
unname(colSums(sapply(mget(nm1), `[[`, 'd') * 100))
#800 800
data
df1 <- structure(list(a = c(1L, 1L), b = c(2L, 2L), c = c(3L, 3L), d = c(4L,
4L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(a = c(1L, 1L), b = c(2L, 2L), c = c(3L, 3L), d = c(4L,
4L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c(NA,
-2L))

Related

Check if value from one data frame can be found in another data frame in R

I have two data.frames as follows:
a$id <- as.data.frame(c("1-23-2", "2-3-231-2", "122-121"))
b$id <- as.data.frame(c("1-23-2", "122-121", "12-1223-12", "1221-12"))
I want to check, if all values of a can be found in b.
I tried this:
if (a$id %in% b$id){a$test <- "yes"} else {a$test <- "no"}
Which gives a warning message and the wrong result unfortunately.
Use ifelse.
a$test <- ifelse(a$id %in% b$id, "yeah", "no")
a
# id test
# 1 1-23-2 yeah
# 2 2-3-231-2 no
# 3 122-121 yeah
Data
a <- structure(list(id = structure(c(1L, 3L, 2L), .Label = c("1-23-2",
"122-121", "2-3-231-2"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
b <- structure(list(id = structure(c(1L, 3L, 2L, 4L), .Label = c("1-23-2",
"12-1223-12", "122-121", "1221-12"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
You may have several base R approaches to make it, e.g.,
a <- within(a,test <- ifelse(id %in% b$id,"yes","no"))
or
a <- within(a,test <- c("yes","no")[(!id%in% b$id) + 1])
or
a <- within(a,test <- c("yes","no")[is.na(match(id,b$id))+1])
such that
> a
id test
1 1-23-2 yes
2 2-3-231-2 no
3 122-121 yes
DATA
a <- data.frame(id = c("1-23-2", "2-3-231-2", "122-121"))
b <- data.frame(id = c("1-23-2", "122-121", "12-1223-12", "1221-12"))

The first two columns defined as "rownames"

I want to define the first two columns of a data frame as rownames. Actually I want to do some calculations and the data frame has to be numeric for that.
data.frame <- data_frame(id=c("A1","B2"),name=c("julia","daniel"),BMI=c("20","49"))
The values for BMI are numerical (proved with is.numeric), but the over all data.frame not. How to define the first two columns (id and name) as rownames?
Thank you in advance for any suggestions
You can combine id and name column and then assign rownames
data.frame %>%
tidyr::unite(rowname, id, name) %>%
tibble::column_to_rownames()
# BMI
#A1_julia 20
#B2_daniel 49
In base R, you can do the same in steps as
data.frame <- as.data.frame(data.frame)
rownames(data.frame) <- paste(data.frame$id, data.frame$name, sep = "_")
data.frame[c('id', 'name')] <- NULL
Not sure if the code and result below is the thing you are after:
dfout <- `rownames<-`(data.frame(BMI = as.numeric(df$BMI)),paste(df$id,df$name))
such that
> dfout
BMI
A1 julia 20
B2 daniel 49
DATA
df <- structure(list(id = structure(1:2, .Label = c("A1", "B2"), class = "factor"),
name = structure(2:1, .Label = c("daniel", "julia"), class = "factor"),
BMI = structure(1:2, .Label = c("20", "49"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))

Get single column of values comparing multiple columns

I have just started my journey with R. I want to test values across multiple columns for the same condition and return 5 if any of the values is "hello" within a row:
result = ifelse((myData[1] == "hello") | (myData[2] == "hello") | (myData[3] == "hello"), 5, 0)
This works fine, but code seems to be redundant. When I do:
resultSec = ifelse(myData[1:3] == "hello", 5, 0)
Then all 3 columns are checked against the condition, but the result I get is not a single column, but 3 columns. So then I would have to perform an additional comparison for all columns which makes totally more lines of code then the first redundant method.
How can I get in this case a one column of values in efficient way ?
You can use the function apply() to iterate over a data.frame or matrix, by either columns or rows. The margin argument determines which one you use.
Here we want to check the rows, so we use margin = 1:
dat <- data.frame(col1 = c("happy", "sad", "mad"),
col2 = c("tired", "sleepy", "happy"),
col3 = c("relaxed", "focused", "fine"))
dat$res <- apply(X = dat, MARGIN = 1,
FUN = function(x) ifelse("happy" %in% x, 5, 0))
dat
col1 col2 col3 res
1 happy tired relaxed 5
2 sad sleepy focused 0
3 mad happy fine 5
We can use rowSums here
df1$res <- rowSums(df1 == "happy") * 5
df1$res
#[1] 5 0 5
data
df1 <- structure(list(col1 = structure(c(1L, 3L, 2L), .Label = c("happy",
"mad", "sad"), class = "factor"), col2 = structure(c(3L, 2L,
1L), .Label = c("happy", "sleepy", "tired"), class = "factor"),
col3 = structure(c(3L, 2L, 1L), .Label = c("fine", "focused",
"relaxed"), class = "factor")), .Names = c("col1", "col2",
"col3"), row.names = c(NA, -3L), class = "data.frame")

How to I edit a large number of row names in R?

I have 100's of rows I want to edit so I'd rather not do it "manually" via these scripts:
a <-data.frame(name=c("A","B","C","D", b=1:4)
rownames(df) <- a$name
All rows have the same signifier I want to remove, ".meio", such that the rownames are currently:
A.meio, B.meio, C.meio, D.meio ...
I would like the row names to be
A, B, C, D, etc.
How can I do this efficiently?
Thank you.
You can use gsub the function.
Supposedly it works like...
> a <- structure(list(name = structure(1:4, .Label = c("A", "B",
+ "C",
+ "D"), class = "factor"), b = 1:4), .Names = c("name", "b"),
+ row.names = c("A.meio",
+ "B.meio", "C.meio", "D.meio"), class = "data.frame")
> a
name b
A.meio A 1
B.meio B 2
C.meio C 3
D.meio D 4
> row.names(a)=gsub(".meio","",row.names(a))
> a
name b
A A 1
B B 2
C C 3
D D 4
The difference is that sub only replaces the first occurrence of the pattern specified, whereas gsub does it for all occurrences (that is, it replaces globally).
We could use sub to match the pattern . followed by one or more characters (.*) to the end of the string ($) and replace it with ''.
row.names(a) <- sub("\\..*$", '', row.names(a))
NOTE: From the example showed by the OP, it seems that there is only a single instance of .meio, so sub is sufficient.
data
a <- structure(list(name = structure(1:4, .Label = c("A", "B",
"C",
"D"), class = "factor"), b = 1:4), .Names = c("name", "b"),
row.names = c("A.meio",
"B.meio", "C.meio", "D.meio"), class = "data.frame")

Creating new dataframe using weighted averages from dataframes within list

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.
The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
Update
In response to #jazzuro's comment, Using dplyr 0.3, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1

Resources