So I have a small problem in R. I have multiple data sets (data0, data1,...) and I want to do the following:
data01 <- data0[1:6,]
data02 <- data0[7:12,]
data11 <- data1[1:6,]
data12 <- data1[7:12,]
data21 <- data2[1:6,]
data22 <- data2[7:12,]
data31 <- data3[1:6,]
data32 <- data3[7:12,]
...etc
I would like to do this in a for loop like so:
for(i in 1:(some high number)){
datai1 <- datai[1:6,]
datai2 <- datai[7:12,]
}
I've tried messing around with assign() and get(), however I cannot make it work. I found something that might work in this question, however the difference is that here the variable d should also change depending on the index. Any idea how I could make this work?
Here is a more R-like approach than using assign:
data1 <- data0 <- data.frame(x = 1:12, y = letters[1:12]) #some data
mylist <- mget(ls(pattern = "data\\d")) #collect free floating objects into list
#it would be better to put the data.frames into a list when you create them
res <- lapply(mylist, function(d) split(d[1:12,], rep(1:2, each = 6))) #loop over list and split each data.frame
The result is a nested list and it's easy to extract its elements:
res[["data1"]][["2"]]
# x y
#7 7 g
#8 8 h
#9 9 i
#10 10 j
#11 11 k
#12 12 l
Assemble the variable names with paste() and then use get() and assign() as you suggest.
for (i in 1:10) {
datai <- get(paste('data', i, sep = ''))
assign(paste('data', i, '1', sep = ''), datai[1:6,])
assign(paste('data', i, '2', sep = ''), datai[7:12,])
}
Related
I have troubles using the grep function within a for loop.
In my data set, I have several columns where only the last 5-6 letters change. With the loop I want to use the same functions for all 16 situations.
Here is my code:
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
View(situations)
for (i in situations[1:16]) {
## Trust Skala
a <- vector("numeric", length = 1L)
b <- vector("numeric", length = 1L)
a <- grep("Tru_1_[i]", colnames(cleandata))
b <- grep("Tru_5_[i]", colnames(cleandata))
cleandata[, c(a:b)] <- 8-cleandata[, c(a:b)]
attach(cleandata)
cleandata$scale_tru_[i] <- (Tru_1_[i] + Tru_2_[i] + Tru_3_[i] + Tru_4_[i] + Tru_5_[i])/5
detach(cleandata)
}
With the grep function I first want to finde the column number of e.g. Tru_1_KKKTS and Tru_5_KKKTS. Then I want to reverse code the items of the specific column numbers. The last part worked without the loop when I manually used grep for every single situation.
Here ist the manual version:
# KKKTS
grep("Tru_1_KKKTS", colnames(cleandata)) #29 -> find the index of respective column
grep("Tru_5_KKKTS", colnames(cleandata)) #33
cleandata[,c(29:33)] <- 8-cleandata[c(29:33)] # trust scale ranges from 1 to 7 [8-1/2/3/4/5/6/7 = 7/6/5/4/3/2/1]
attach(cleandata)
cleandata$scale_tru_KKKTS <- (Tru_1_KKKTS + Tru_2_KKKTS + Tru_3_KKKTS + Tru_4_KKKTS + Tru_5_KKKTS)/5
detach(cleandata)
You can do:
Mean5 <- function(sit) {
cnames <- paste0("Tru_", 1:5, "_", sit)
rowMeans(cleandata[cnames])
}
cleandata[, paste0("scale_tru_", situations)] <- sapply(situations, FUN=Mean5)
how about something like this. It's a bit more compact and you don't have to use attach..
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
for (i in situations[1:16]) {
cols <- paste("Tru", 1:5, i, sep = "_")
result <- paste("scale_tru" , i, sep = "_")
cleandata[cols] <- 8 - cleandata[cols]
cleandata[result] <- rowMeans(cleandata[cols])
}
I took for granted that when you write a:b you mean all the columns between those, which I assumed were named from 2 to 4
situations <- c("KKKTS", "KKKNL", "KKDTS", "KKDNL", "NkKKTS", "NkKKNL", "NkKDTS", "NkKDNL", "KTKTS", "KTKNL", "KTDTS", "KTDNL", "NkTKTS", "NkTKNL", "NkTDTS", "NkTDNL")
# constructor for column names
get_col_names <- function(part) paste("Tru", 1:5, part, sep="_")
for (situation in situtations) {
# revert the values in the columns in situ
cleandata[, get_col_names(situation)] <- 8 - cleandata[, get_col_names(situtation)]
# and calculate the average
subdf <- cleandata[, get_col_names(situation)]
cleandata[, paste0("scale_tru_", situation)] <- rowSums(subdf)/ncol(subdf)
}
By the way, you call it "scale" but your code shows an average/mean calculation.
(Scale without centering).
My first post here so please tell me if I'm missing any important information.
I am handling a lot of data in form of time(1:30=rowID) vs value all stored in a number of dataframes and I need to keep it as a data.frame.
I wrote a function that gets dataframes from my global environment and sorts the columns in each set into new data frames depending on their values.
So I start with a list of names of my data frames as input for my function and then end with assigning the created new dataframes to my global environment while using the assign function.
The dataframes I get all are 30 rows long, but have different column length depending on how often a case appears in a dataset. The names of each dataframe represent one data set and the column names inside represent one timeline. I use data frames, so I don't loose the information of the column name.
This works for having 0 cases and everything above 1.
But if a data.frame ends up with only one column and I use the assign function it appears as a vector in my global environment instead of a data frame. Therefore I loose the name of the column and my other functions that only use data frames stop at such a case and throw errors.
Here is a basic example of my problem:
#create two datasets with different cases
data1 <- data.frame(matrix(nrow=30, ncol=5))
data1[1] <- c(rep(1,each=30))
data1[2] <- c(rep(5, each=30))
data1[3] <- c(rep(5, each=30))
data1[4] <- c(rep(10, each=30))
data1[5] <- c(rep(10, each=30))
data2 <- data.frame(matrix(nrow=30, ncol=6))
data2[1] <- c(rep(5,each=30))
data2[2] <- c(rep(1, each=30))
data2[3] <- c(rep(1, each=30))
data2[4] <- c(rep(0, each=30))
data2[5] <- c(rep(0, each=30))
data2[6] <- c(rep(10, each=30))
#create list with names of datasets
names <- c('data1','data2')
#function for sorting
examplefunction <- function(VarNames) {
for (i in 1:length(VarNames)) {
#get current dataset
name <- VarNames[i]
data <- get(VarNames[i])
#create new empty data.frames for sorting
data.0 <- data.frame(matrix(nrow=30))
name.data.0 <- paste(name,"0", sep=".")
c.0 = 2 #start at second column, since first doesn't like the colname later
data.1 <- data.frame(matrix(nrow=30))
name.data.1 <- paste(name,"1", sep=".")
c.1 = 2
data.5 <- data.frame(matrix(nrow=30))
name.data.5 <- paste(name,"5", sep=".")
c.5 = 2
data.10 <- data.frame(matrix(nrow=30))
name.data.10 <- paste(name,"10", sep=".")
c.10 = 2
#sort data into new different data.frames
for (c in 1:ncol(data)) {
if(data[1,c]==0) {
data.0[c.0] = data[c]
c.0 = c.0 +1
}
else if(data[1,c]==1) {
data.1[c.1] = data[c]
c.1 = c.1 +1
}
else if(data[1,c]==5) {
data.5[c.5] = data[c]
c.5 = c.5 +1
}
else if(data[1,c]==10) {
data.10[c.10] = data[c]
c.10 = c.10 +1
}
else (stop="new values")
}
#remove first column with weird name
data.0 <- data.0[,-1]
data.1 <- data.1[,-1]
data.5 <- data.5[,-1]
data.10 <- data.10[,-1]
#assign data frames to global environment
assign(name.data.0, data.0, envir = .GlobalEnv)
assign(name.data.1, data.1, envir = .GlobalEnv)
assign(name.data.5, data.5, envir = .GlobalEnv)
assign(name.data.10, data.10, envir = .GlobalEnv)
}
}
#function call
examplefunction(names)
As explained before, if you run this you will end up with data frames of 0 variables and >1 variables.
And three vectors, where the data frame had only one column.
So my questions are:
1. Is there any way to keep the data type and forcing R to assign it to a data frame instead of a vector?
2. Or is there an alternative function I could use instead of assign()? If I use <<- how can I do the name assigning as above?
You can use drop = FALSE when subsetting:
examplefunction <- function(VarNames) {
for (i in 1:length(VarNames)) {
#get current dataset
name <- VarNames[i]
data <- get(VarNames[i])
#create new empty data.frames for sorting
data.0 <- data.frame(matrix(nrow=30))
name.data.0 <- paste(name,"0", sep=".")
c.0 = 2 #start at second column, since first doesn't like the colname later
data.1 <- data.frame(matrix(nrow=30))
name.data.1 <- paste(name,"1", sep=".")
c.1 = 2
data.5 <- data.frame(matrix(nrow=30))
name.data.5 <- paste(name,"5", sep=".")
c.5 = 2
data.10 <- data.frame(matrix(nrow=30))
name.data.10 <- paste(name,"10", sep=".")
c.10 = 2
#sort data into new different data.frames
for (c in 1:ncol(data)) {
if(data[1,c]==0) {
data.0[c.0] = data[c]
c.0 = c.0 +1
}
else if(data[1,c]==1) {
data.1[c.1] = data[c]
c.1 = c.1 +1
}
else if(data[1,c]==5) {
data.5[c.5] = data[c]
c.5 = c.5 +1
}
else if(data[1,c]==10) {
data.10[c.10] = data[c]
c.10 = c.10 +1
}
else (stop="new values")
}
#remove first column with weird name
data.0 <- data.0[ , -1, drop = FALSE]
data.1 <- data.1[ , -1, drop = FALSE]
data.5 <- data.5[ , -1, drop = FALSE]
data.10 <- data.10[ , -1, drop = FALSE]
#assign data frames to global environment
assign(name.data.0, data.0, envir = .GlobalEnv)
assign(name.data.1, data.1, envir = .GlobalEnv)
assign(name.data.5, data.5, envir = .GlobalEnv)
assign(name.data.10, data.10, envir = .GlobalEnv)
}
}
#function call
examplefunction(names)
Let's take a look at the one-column dataframes:
str(data1.1)
'data.frame': 30 obs. of 1 variable:
$ X1: num 1 1 1 1 1 1 1 1 1 1 ...
str(data2.10)
'data.frame': 30 obs. of 1 variable:
$ X6: num 10 10 10 10 10 10 10 10 10 10 ...
Now, all that said, I agree with Roland's comment -- you almost never want to take this approach of assigning to the global environment in a complicated way, and instead should return a list; that's best practice. However, you'd still need drop = FALSE to keep the column names.
Really, to me, there's probably an entirely different approach to doing whatever kind of data wrangling you're wanting to do that is a much better approach. I just don't have a good grasp of your task to make a suggestion.
I have 8 datasets and I want to apply a function to convert any number less than 5 to NA on 3 columns(var1,var2,var3) of each dataset. How can I write a function to do it effectively and faster ? I went through lots of such questions on Stack overflow but I didnt find any answer where specific columns were used. I have written the function to replace but cant figure out how to apply to all the datasets.
Input:
Data1
variable1 variable2 variable3 variable4
10 36 56 99
15 3 2 56
4 24 1 1
Expected output:
variable1 variable2 variable3 variable4
10 36 56 99
15 NA NA 56
NA 24 NA 1
Perform the same thing for 7 more datasets.
Till now I have stored the needed variables and datasets in two different list.
var1=enquo(variable1)
var2=enquo(variable2)
var3=enquo(variable3)
Total=3
listofdfs=list()
listofdfs_1=list()
for(i in 1:8) {
df=sym((paste0("Data",i)))
listofdfs[[i]]=df
}
for(e in 1:Ttoal) {
listofdfs[[e]]= eval(sym(paste0("var",e)))
}
The selected columns will go through this function:
temp_1=function(x,h) {
h=enquo(h)
for(e in 1:Total) {
if(substr(eval(sym(paste0("var",e))),1,3)=="var") {
y= x %>% mutate_at(vars(!!h), ~ replace(., which(.<=5),NA))
return(y)
}
}
}
I was expecting something :
lapply(for each dataset's selected columns,temp_1)
Here's a simple approach that should work:
cols_to_edit = paste0("var", 1:3)
result_list = lapply(list_of_dfs, function(x) {
x[cols_to_edit][x[cols_to_edit] < 5] = NA
return(x)
})
I assume your starting data is in a list called list_of_dfs, that the names of columns to edit are the same in all data frames, and that you can construct a character vector cols_to_edit with those names.
Here is a solution to the problem in the question.
First of all, create a test data set.
createData <- function(Total = 3){
numcols <- Total + 1
set.seed(1234)
for(i in 1:8){
tmp <- replicate(numcols, sample(10, 20, TRUE))
tmp <- as.data.frame(tmp)
names(tmp) <- paste0("var", seq_len(numcols))
assign(paste0("Data", i), tmp, envir = .GlobalEnv)
}
}
createData()
Now, the data transformation.
This is much easier if the many dataframes are in a "list".
df_list <- mget(ls(pattern = "^Data"))
I will present solutions, a base R solution and a tidyverse one. Note that both solutions will use function temp_1, written in base R only.
library(tidyverse)
temp_1 <- function(x, h){
f <- function(v){
is.na(v) <- v <= 5
v
}
x[h] <- lapply(x[h], f)
x
}
h <- grep("var[123]", names(df_list[[1]]), value = TRUE)
df_list1 <- lapply(df_list, temp_1, h)
df_list2 <- df_list %>% map(temp_1, h)
identical(df_list1, df_list2)
#[1] TRUE
I have a set of datasets that end with .fin. I would like to create a list and merge them using
ls(pattern = ".fin")
"A.fin" "B.fin" "C.fin" "D.fin" "E.fin" "F.fin" "G.fin" "H.fin" "I.fin"
"J.fin" "K.fin" "L.fin" "M.fin" "N.fin"
I would like to go from the line and code above to the line below beginning with list, like list(ls(pattern = ".fin")); however this only returns a vector in a list of the data set names. I have also tried using list(get(ls(pattern = ".fin")) and list(eval(parse(text = ls(pattern = .fin)))) with no avail.
list(ls(pattern = ".fin")) ### <- REPLACE THIS SOMEHOW %>%
Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="i"), .)
You can use mget:
mget(ls(pattern = ".fin"))
A.fin <- c(1,2,3)
B.fin <- c(4,5,6)
mget(ls(pattern = ".fin"))
#$A.fin
#[1] 1 2 3
#$B.fin
#[1] 4 5 6
get is not vectorized so you should "loop" over whatever ls() is returning. You can do that either
sapply(ls(pattern = ".fin"), FUN = get)
or the long way
xy <- ls(pattern = ".fin")
mylist <- vector("list", length(xy))
for (i in 1:length(mylist)) {
mylist[[i]] <- get(xy[i])
}
or use mget(ls(pattern = ".fin")).
I want to run split() in a for loop, but when I pass it variable text, it just creates a new data.frame containing the text. The idea here is to split CMPD_DF_1, CMPD_DF_2, etc. based on CMPD_DF_1[5], CMPD_DF_2[5], etc. How do I pass in the data.frame and not a string?
for (i in 1:10) {
split(paste("CMPD_DF", i, sep = "_"),
paste(paste("CMPD_DF", i, sep = "_"), "[5]", sep=""))
}
Sorry for the initial confusion. You can put your data frames in a list and then use lapply. This assumes the column you are splitting on is the same in each data frame. I'll update with a more general solution...
d1 <- data.frame(x =1:10, y = rep(letters[1:2], each = 5))
d2 <- d1
l <- list(d1,d2)
myFun <- function(x){
return(split(x,x[,2]))
}
lapply(l,myFun)
And here's a way to do this using mapply that will allow for different splitting columns in each data frame. You just pre-specify the columns in a separate list and pass them to mapply:
l <- list(d1,d2)
splitColumns <- list("y","y")
myFun2 <- function(x,col){
return(split(x,x[,col]))
}
mapply(myFun2,l,splitColumns,SIMPLIFY = FALSE)
Your code doesn't work because you're not passing a data.frame to split. You're passing a character vector that contains a string with the name of your data.frame. Something like this should work, but it's not very R-like. #joran's answer is preferable.
for (i in 1:10) {
dfname <- paste("CMPD_DF", i, sep = "_")
split(get(dfname), get(dfname)[5])
}