I have a dataset that has 148 columns. I need to combine them in groups of 4.
For example V1,V2,V3,V4 =X1
V1 <-c(0,3)
V2 <-c(F,F)
V3 <-c (2,4)
V4 <-c(A,C)
X1
0F2A
3F4C
I know I can use
```{r}
new_data_4 <-new%>%
unite(V1:V4)%>%
unite(V5:V8)%>%
unite(V9:V12)%>%
unite(V13:V16)
```
with great success but I would like to make this function. My wish is that it can count the number of columns and do it automatically without hardcoding the numbers. I have more files to go over. I have looked around StackOverFlow and found LOTS of examples with specific problems that don't really jive with what I have.
I have tried:
```{r}
unite_columns <-function(x){
united_cols <-tidyr::unite(x, seq_along(1, ncol(x), 4), seq_along(4, ncol(x), 4))
return(united_cols)
}
```
and
```{r}
unite_columns <-function(x){
united_cols <-unite(x, seq(1, ncol(x), 4), seq(4, ncol(x), 4))
return(united_cols)
}
```
I was thinking I could use a similar tactic that is used to merge strings but it did not work.
Any help would be greatly appreciated. TIA
You can use split.default to split the columns in every 4 columns and paste the values rowwise using do.call.
result <- data.frame(sapply(split.default(new, ceiling(seq_along(new)/4)),
function(x) do.call(paste0, x)))
# X1 X2
#1 0F2A 8A4R
#2 3F4C 9B5K
data
new <- data.frame(V1 = c(0,3), V2 = c("F","F"), V3 = c(2,4), V4 = c("A","C"),
V5 = c(8, 9), V6 = c("A", "B"), V7 = c(4, 5), V8 = c("R", "K"))
new
# V1 V2 V3 V4 V5 V6 V7 V8
#1 0 F 2 A 8 A 4 R
#2 3 F 4 C 9 B 5 K
Using Reduce.
sapply(1:(ncol(new)/4), function(f) Reduce(paste0, new[1:4*f]))
# [,1] [,2] [,3]
# [1,] "0F2A" "FAAR" "2A1D"
# [2,] "3F4C" "FCBK" "4B2S"
If you want a data frame:
as.data.frame(sapply(1:(ncol(new)/4), function(f) Reduce(paste0, new[1:4*f])))
# V1 V2 V3
# 1 0F2A FAAR 2A1D
# 2 3F4C FCBK 4B2S
Data
new <- structure(list(V1 = c(0, 3), V2 = c("F", "F"), V3 = c(2, 4),
V4 = c("A", "C"), V5 = c(8, 9), V6 = c("A", "B"), V7 = c(4,
5), V8 = c("R", "K"), V9 = c(1, 2), V10 = c("C", "D"), V11 = c(9,
8), V12 = c("D", "S")), class = "data.frame", row.names = c(NA,
-2L))
Another purrr and tidyr option could be:
imap_dfc(.x = split.default(df, ceiling(1:ncol(df)/4)),
~ .x %>%
unite(col = !!paste0("X", .y), everything(), sep = ""))
X1 X2
1 0F2A 8A4R
2 3F4C 9B5K
You can also do like this, with tidyverse only. I have used purrr::map2dfc to do this.
First argument of map is a sequence of length n/4 (needless to say you may use ncol(new) instead of storing n separately.
and second argument is names of columns to be generated.
At each iteration, the map function will take out four columns as per integer division function used,
name it as per second argument
and then select that column only.
all the lists generated at each iteration of map function will be col-bind and therefore map2_dfc has been used.
I think that is pretty clear.
library(tidyverse)
#say n is 148
n <- 148
map2_dfc(seq_len(n/4), paste0("X", seq_len(n/4)), ~new %>%
unite(!!.y,
seq_along(new)[(3 + seq_along(new)) %/% 4 == .x],
sep = "") %>% select(all_of(.y))
)
Check it for data generated by #Ronak
n <- 8
map2_dfc(seq_len(n/4), paste0("X", seq_len(n/4)), ~new %>%
unite(!!.y,
seq_along(new)[(3 + seq_along(new)) %/% 4 == .x],
sep = "") %>% select(all_of(.y))
)
X1 X2
1 0F2A 8A4R
2 3F4C 9B5K
Or on data generated by #jay.sf
n <- 12
map2_dfc(seq_len(n/4), paste0("X", seq_len(n/4)), ~new %>%
unite(!!.y,
seq_along(new)[(3 + seq_along(new)) %/% 4 == .x],
sep = "") %>% select(all_of(.y))
)
X1 X2 X3
1 0F2A 8A4R 1C9D
2 3F4C 9B5K 2D8S
Related
Once again, I'm facing a problem that I can't transcribe under SparkR.
I have a SparkDataFrame which some columns contain only NAs, and I want to delete all these columns.
I discovered SparkR recently, I think I'm far from understanding all its operation, but it's very frustrating to block on a point yet not so complicated...
Here is the reprex and the way I am doing it in R :
library(data.table)
df <- data.frame(V1 = base::sample(1:10,5), V2 = base::rep(NA,5), V3 = base::sample(1:10,5), V4 = base::rep(NA,5), V5 = base::rep(NA,5), X = runif(n = 5, min = 0, max = 5))
sdf <- createDataFrame(df)
dt <- setDT(df)
na.lst <- sapply(dt, function(x) all(is.na(x)))
dt[, which(na.lst) := NULL]
Thanks !
You can consider the following approach
library(SparkR)
df <- data.frame(V1 = base::sample(1 : 10,5),
V2 = base::rep(NA,5),
V3 = base::sample(1 : 10,5),
V4 = base::rep(NA,5),
V5 = base::rep(NA,5),
X = runif(n = 5, min = 0, max = 5))
sdf <- createDataFrame(df)
col_Names <- colnames(sdf)
nb_Col_Names <- length(col_Names)
vec_Bool <- rep(FALSE, nb_Col_Names)
for(i in 1 : nb_Col_Names)
{
dim_Temp <- dim(dropna(select(sdf, col = col_Names[i]), how = "all"))
if(dim_Temp[1] != 0) vec_Bool[i] <- TRUE
}
col <- col_Names[vec_Bool]
newdf <- select(sdf, col = col)
as.data.frame(newdf)
V1 V3 X
1 6 1 2.286716
2 10 3 3.532843
3 2 9 2.030851
4 8 6 3.304420
5 4 10 1.596272
See Remove columns with only NA values with SparkR
I am working with the R programming language.
In the following link (https://www.geeksforgeeks.org/how-to-find-the-percentage-of-missing-values-in-a-dataframe-in-r/), I found out a method to calculate the total percentage of NAs in a data frame :
# declaring a data frame in R
data_frame = data.frame(C1= c(1, 2, NA, 0),
C2= c( NA, NA, 3, 8),
C3= c("A", "V", "j", "y"),
C4=c(NA,NA,NA,NA))
percentage = mean(is.na(data_frame)) * 100
[1] 43.75
My Question: Is there a way to extend this to count the percentage of "any element" in the data frame?
For example, can this be used to calculate the percentage of 0s in the data set? Or the percentage of times "j" appears in the data? Or the percentage of times "2" appears in the data set?
I can do this manually:
# count percentage of "j" in the data
v1 = nrow(subset(data_frame, C1 == "j"))
v2 = nrow(subset(data_frame, C2 == "j"))
v3 = nrow(subset(data_frame, C3== "j"))
v4 = nrow(subset(data_frame, C4 == "j"))
percentage = ((v1 + v2 + v3 + v4) / ((nrow(data_frame) * ncol(data_frame)))) * 100
[1] 6.25
# count percentage of "0" in the data (I don't think this is right, it should be written as "nrow(subset(data_frame, C1 <= 0))"?)
v1 = nrow(subset(data_frame, C1 = 0))
v2 = nrow(subset(data_frame, C2 = 0))
v3 = nrow(subset(data_frame, C3= 0))
v4 = nrow(subset(data_frame, C4 = 0))
percentage = ((v1 + v2 + v3 + v4) / ((nrow(data_frame) * ncol(data_frame)))) * 100
But is there a faster way to do this?
Thanks!
You can try to unlist your data frame into a vector
vec = unlist(data_frame)
mean(vec %in% "j") * 100 # 6.25
mean(vec %in% "0") * 100 # 6.25
mean(vec %in% NA) * 100 # 43.75
Assuming there are no embedded lists in the cells of the data frame, you don't have to unlist it:
data_frame = data.frame(C1= c(1, 2, NA, 0),
C2= c( NA, NA, 3, 8),
C3= c("A", "V", "j", "y"),
C4=c(NA,NA,NA,NA))
sum(data_frame == 'j', na.rm = TRUE) / prod(dim(data_frame)) * 100
[1] 6.25
sum(data_frame == 0, na.rm = TRUE) / prod(dim(data_frame)) * 100
[1] 6.25
sum(is.na(data_frame)) / prod(dim(data_frame)) * 100
[1] 43.75
Here is a tidyverse + base R solution.
library(tidyverse)
data_frame %>%
mutate(across(everything(), ~ .x %in% "j")) %>%
unlist() %>%
mean() * 100
Output
[1] 6.25
Though this could easily be turned into a function.
calc <- function(df, val) {
df %>%
mutate(across(everything(), ~ .x %in% val)) %>%
unlist() %>%
mean() * 100
}
Output
calc(data_frame, "j") # 6.25
calc(data_frame, "0") # 6.25
calc(data_frame, NA) # 43.75
I made a list out of my dataframe, based on the factor levels in column A. In the list I would like to remove that column. My head is saying lapply, but not anything else :P
$A
ID Test
A 1
A 1
$B
ID Test
B 1
B 3
B 5
Into this
$A
Test
1
1
$B
Test
1
3
5
Assuming your list is called myList, something like this should work:
lapply(myList, function(x) { x["ID"] <- NULL; x })
Update
For a more general solution, you can also use something like this:
# Sample data
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Keep just the "ID" and "Value" columns
lapply(myList, function(x) x[(names(x) %in% c("ID", "Value"))])
# Drop the "ID" and "Value" columns
lapply(myList, function(x) x[!(names(x) %in% c("ID", "Value"))])
If you are tidyverse user there is an alternative solution, which utilizes map function from purrr package.
# Create same sample data as above
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Remove column by name in each element of the list
map(myList, ~ (.x %>% select(-ID)))
We can efficiently use the bracket function "[" here.
Example
L <- replicate(3, iris[1:3, 1:4], simplify=FALSE) # example list
Delete columns by numbers
lapply(L, "[", -c(2, 3))
Delete columns by names
lapply(L, "[", -grep(c("Sepal.Width|Petal.Length"), names(L[[1]])))
Result
# [[1]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
#
# [[2]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
If you had a data frame that didn't contain the ID column, you could use map_if to remove it only where it exists.
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3),
C = data.frame(Test = c(1, 3, 5),
Value = 1:3))
map_if(myList, ~ "ID" %in% names(.x), ~ .x %>% select(-ID), .depth = 2)
This is my code so far:
record <- function(input, string){
filter(input, input$race == string |
input$flag == string)
}
Please help
You could try which. Using data from #RuiBarradas:
set.seed(1234)
recordings <- data.frame(V1 = sample(LETTERS, 10),
V2 = sample(LETTERS, 10),
V3 = letters[1:10], stringsAsFactors = FALSE)
records <- function(recordings, string){
rws <- which(recordings == string, arr.ind = TRUE)[,1]
cls <- which(recordings == string, arr.ind = TRUE)[,2]
recordings <- recordings[rws, -cls, drop = FALSE]
return(recordings)
}
For A, it would return:
records(recordings, "A")
V2 V3
7 F g
For X:
records(recordings, "X")
V3
4 d
5 e
This assumes that no value is present in all columns.
If you need to only know the corresponding row values:
records <- function(recordings, string){
return(which(recordings == string, arr.ind = TRUE)[,1])
}
records(recordings, "X")
[1] 4 5
See if the following is what you want.
First I will make up a dataset, since you have not posted one.
set.seed(1234) # Make the results reproducible
recordings <- data.frame(V1 = sample(LETTERS, 10),
V2 = sample(letters, 10),
V3 = sample(4, 10, TRUE))
Now the function.
records <- function(DF, string){
inx <- DF == string
i <- apply(inx, 1, function(x) Reduce('||', x))
DF[i, which(colSums(!inx) == nrow(DF)), drop = FALSE]
}
records(recordings, "A")
# V2 V3
#7 f 3
records(recordings, "x")
# V1 V3
#5 S 1
I made a list out of my dataframe, based on the factor levels in column A. In the list I would like to remove that column. My head is saying lapply, but not anything else :P
$A
ID Test
A 1
A 1
$B
ID Test
B 1
B 3
B 5
Into this
$A
Test
1
1
$B
Test
1
3
5
Assuming your list is called myList, something like this should work:
lapply(myList, function(x) { x["ID"] <- NULL; x })
Update
For a more general solution, you can also use something like this:
# Sample data
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Keep just the "ID" and "Value" columns
lapply(myList, function(x) x[(names(x) %in% c("ID", "Value"))])
# Drop the "ID" and "Value" columns
lapply(myList, function(x) x[!(names(x) %in% c("ID", "Value"))])
If you are tidyverse user there is an alternative solution, which utilizes map function from purrr package.
# Create same sample data as above
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3))
# Remove column by name in each element of the list
map(myList, ~ (.x %>% select(-ID)))
We can efficiently use the bracket function "[" here.
Example
L <- replicate(3, iris[1:3, 1:4], simplify=FALSE) # example list
Delete columns by numbers
lapply(L, "[", -c(2, 3))
Delete columns by names
lapply(L, "[", -grep(c("Sepal.Width|Petal.Length"), names(L[[1]])))
Result
# [[1]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
#
# [[2]]
# Sepal.Length Petal.Width
# 1 5.1 0.2
# 2 4.9 0.2
# 3 4.7 0.2
If you had a data frame that didn't contain the ID column, you could use map_if to remove it only where it exists.
myList <- list(A = data.frame(ID = c("A", "A"),
Test = c(1, 1),
Value = 1:2),
B = data.frame(ID = c("B", "B", "B"),
Test = c(1, 3, 5),
Value = 1:3),
C = data.frame(Test = c(1, 3, 5),
Value = 1:3))
map_if(myList, ~ "ID" %in% names(.x), ~ .x %>% select(-ID), .depth = 2)