I had a bit specific problem in running for loops in colnames , increment i by 10 and creating new dataframe using i.
For example
x <- data.frame(A = c(1, 2), B = c(3, 4),C =c(5,6),D=c(7,8),E=c(9,10),F=c(11,12),G=c(13,14),
H=c(16,17),I=c(18,19),J=c(22,25),K=c(12,13),L=c(19,20))
# below create 12 dataframe starting from A to L which i do not want
for (i in colnames(x))
assign(i, subset(x, select=i))
I want to increment i by 3, so I want my output as col A to C in one dataframe, col D to F in one dataframe, col G to I in one dataframe and col J to L in one dataframe, which means only 4 dataframes not 12.
Assigning to the global environment is generally not the way to go, especially from functions. You could do the following, generating a list containg the splitted dataframes.
Make a vector of indices where a 'new' dataframe should start, starting at 1 and incrementing by i.
i<- 3
start_indices <- seq(1,ncol(x),by=i)
> start_indices
[1] 1 4 7 10
Use lapply to generate a list of splitted dataframes.
res <- lapply(start_indices, function(j){
return(x[,j:(j+i-1)])
})
>res
[[1]]
A B C
1 1 3 5
2 2 4 6
[[2]]
D E F
1 7 9 11
2 8 10 12
[[3]]
G H I
1 13 16 18
2 14 17 19
[[4]]
J K L
1 22 12 19
2 25 13 20
If you want to use your approach
> for (i in 1:(ncol(x)/3))
+ assign(names(x)[3*i-2], subset(x, select=(3*i-2):(3*i)))
> A
A B C
1 1 3 5
2 2 4 6
> D
D E F
1 7 9 11
2 8 10 12
> G
G H I
1 13 16 18
2 14 17 19
> J
J K L
1 22 12 19
2 25 13 20
Just thought to add last line on unlisting list ,previous answer by Heroka
create multiple data frame from list of
for(i in 1:length(res)) {
assign(paste0("gf", i), res[[i]])
}
Related
I have a data and a vector contain name of variables and i want to create new variable contain rowsum of variables in my vector, and i want the name of new variable ( sum of variables in my vector) to be concatenation of names of variables
for example i have this data
> data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
and this vector
>Vec
"A" , "C" , "D"
the result i want is the sum of Variables A , C and D and the name of my variable is ACD
here's the result i want :
> data
Name A B C D ACD E
r1 1 5 12 21 34 15
r2 2 4 7 10 18 9
r3 5 15 6 9 20 6
r4 7 8 0 7 14 18
I tried this :
data <- cbind(data , as.data.frame(rowSums(data[,Vec]) ))
But i don't know how to create the name
Here's the result i got
>data
Name A B C D E rowSums(data[,Vec])
r1 1 5 12 21 15 34
r2 2 4 7 10 9 18
r3 5 15 6 9 6 20
r4 7 8 0 7 18 14
Not that i gave just a sample example to explain what i want to do
i want to do affectation of my old data to my new data ( that contains the new variable), like i did in my command above
edit 1 : in my real program , i don't know the elements ( name of my variables in my vector so i can not do data$ACD <- cbind(data , as.data.frame(rowSums(data[,Vec]) )) as suggested by Pax, in fact i have for loop that generate my vectors and each time i create variable to put the result i want ( sum of variable in my vector) so i don't know how to affect the name without knowing the elements of vectors
Please tell me if you need anymore clarifications or informations
Thank you
It's not a one line solution but you can set the name on the subsequent line:
data <- data.frame(A = c(1, 2, 5, 7),
B = c(5, 4, 15, 8),
C = c(12, 7, 6, 0),
D = c(21, 10, 9, 7),
E = c(15, 9, 6, 18))
Vec <- c("A" , "C" , "D")
data <- cbind(data, rowSums(data[,Vec]))
# Add name
names(data)[ncol(data)] <- paste(Vec, collapse="")
# A B C D E ACD
# 1 1 5 12 21 15 34
# 2 2 4 7 10 9 19
# 3 5 15 6 9 6 20
# 4 7 8 0 7 18 14
Here is an option with the janitor package. You can use adorn_totals which appends a totals row or column to a data.frame. The name argument includes the name of the new column in this case, and final Vec included at the end includes the columns to total.
library(janitor)
adorn_totals(data, "col", fill = NA, na.rm = TRUE, name = paste(Vec, collapse = ""), all_of(Vec))
Output
A B C D E ACD
1 5 12 21 15 34
2 4 7 10 9 19
5 15 6 9 6 20
7 8 0 7 18 14
I have data to analyse that is presented in the form of a list (just one row and MANY columns).
A B C D E F G H I
1 2 3 4 5 6 7 8 9
Is there a way to tell R to split this list every x items and get something as seen below (the columns C D E F G H I are virtually the same as A B)?
A B
1 2
3 4
5 6
7 8
9
If the number of columns is a multiple of 'x', then we unlist the dataset, and use matrix to create the expected output.
as.data.frame(matrix(unlist(df1), ncol=2, dimnames=list(NULL, c("A", "B")) , byrow=TRUE))
If the number of columns is not a multiple of 'x', then
x <- 2
gr <- as.numeric(gl(ncol(df1), x, ncol(df1)))
lst <- split(unlist(df1), gr)
do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
# A B
# 1 1 2
# 2 3 4
# 3 5 6
# 4 7 8
# 5 9 NA
I have a large data frame that I would like to convert in to smaller subset data frames using a for loop. I want the new data frames to be based on the the values in a column in the large/parent data frame. Here is an example
x<- 1:20
y <- c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","C","C","C")
df <- as.data.frame(cbind(x,y))
ok, now I want three data frames, one will be columns x and y but only where y == "A", the second where y==
"B" etc etc. So the end result will be 3 new data frames df.A, df.B, and df.C. I realize that this would be easy to do out of a for loop but my actual data has a lot of levels of y so using a for loop (or similar) would be nice.
Thanks!
If you want to create separate objects in a loop, you can use assign. I used unique because you said you had many levels.
for(i in unique(df$y)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$y==i,])
}
> df.A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
> df.B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
I think you just need the split function:
split(df, df$y)
$A
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 8 A
$B
x y
9 9 B
10 10 B
11 11 B
12 12 B
13 13 B
14 14 B
15 15 B
16 16 B
17 17 B
$C
x y
18 18 C
19 19 C
20 20 C
It is just a matter of properly subsetting the output to split and store the results to objects like dfA <- split(df, df$y)[[1]] and dfB <- split(df, df$y)[[2]] and so on.
I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]
I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17