Related
I want to remove a column from a data frame only if it's there.
Example:
a <- 1:5
x <- tibble(a, b = a * 2, c = 1)
x %>% select(-'a')
x %>% select(-'d') # Throws an error
I want a way to remove columns a and d only if they exist, so a is removed and the attempt to remove d never happens. I tried modifying this solution to my problem, but I could not get it to work.
data.table
library(data.table)
a <- 1:5
x <- data.frame(a, b = a * 2, c = 1)
cols <- c("a", "d")
my_cols <- intersect(cols, names(x))
setDT(x)[, ..my_cols]
#> a
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
Created on 2021-07-09 by the reprex package (v2.0.0)
I have a number of matrices that they all have the same type of elements but different lengths. Columns in all files are the same (lets call them "A" and "B") but rows between files are mostly the same elements but not always.
Here are some example data (in the form of dataframes)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
as you can see as far as the rows go even though "alpha","beta" and "gamma" are always present many of the others are not always there
I would like to calculate 2 things:
the average values of all A and B columns in all matrices and ideally that would be by creating an ave.matr that would have all rownames and the average/mean values of the columns "A" and "B"
A B
alpha 1 7
beta 2 6
delta 3 5
gamma 4 4
zeta 5 3
theta 6 2
epsilon 7 1
(where the above numbers are the mean values of all matrices)
and then an occurrence matrix, lets call it occur.matr that would count the number of occurrences of each row across all matrices and it should look like that
A B
alpha 3
beta 3
delta 2
gamma 3
zeta 2
theta 1
epsilon 1
I have started working on this today but I cannot figure out how to do it.
I started by creating a list and a matrix with the unique rownames from all matrices
list=c(rownames(df1),rownames(df2),rownames(df3))
unique=unique(list)
avematr<-matrix(NA,nrow=length(unique),ncol=2)
and my next step would be to make rownames of all matrices identical. I tried with match but i cannot figure it out but at this moment I dont even know if this is the best strategy...
And all similar questions out there are related to merging the matrices (which is not what I want to do).
Any help is greatly appreciated
Here is a tidyverse approach:
library(tidyverse)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
dat <- list(df1, df2, df3) %>%
map_dfr(rownames_to_column)
avg_dat <- dat %>%
group_by(id) %>%
summarise(A = mean(A),
B = mean(B))
#> `summarise()` ungrouping output (override with `.groups` argument)
avg_dat
#> # A tibble: 7 x 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 alpha 1 5
#> 2 beta 2 4
#> 3 delta 3 4
#> 4 epsilon 7 1
#> 5 gamma 3.67 2.33
#> 6 theta 6 2
#> 7 zeta 5 2
occ_dat <- dat %>% count(id)
occ_dat
#> id n
#> 1 alpha 3
#> 2 beta 3
#> 3 delta 2
#> 4 epsilon 1
#> 5 gamma 3
#> 6 theta 1
#> 7 zeta 2
Created on 2021-01-27 by the reprex package (v0.3.0)
If you want to stick to base R:
For the averaging task it makes things easier when you add your rowname as a column. This prevents autonumbering of rownames when combining the dataframes. You then can simply loop over every unique rowname and construct the averages. A quick and dirty solution could look like this:
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
add_row_names_to_df <- function(df) {
df$rn <- rownames(df)
return(df)
}
new_df <- rbind(add_row_names_to_df(df1),
add_row_names_to_df(df2),
add_row_names_to_df(df3))
avg_df <- as.data.frame(matrix(unique(new_df$rn),
nrow = length(unique(new_df$rn)),
ncol = 3))
for(i in 1:nrow(avg_df)) {
avg.df[i,] <- c(avg_df[i,1],
mean(new_df$A[new_df$rn==avg_df[i,1]]),
mean(new_df$B[new_df$rn==avg_df[i,1]]))
}
colnames(avg_df) <- c("rowname", "avgA", "avgB")
avg_df
results in:
rowname avgA avgB
1 alpha 1 5
2 beta 2 4
3 gamma 3.66666666666667 2.33333333333333
4 delta 3 4
5 zeta 5 2
6 theta 6 2
7 epsilon 7 1
For the occurence matrix you can use the table() function from R:
as.matrix(table(c(rownames(df1),rownames(df2),rownames(df3))))
yields:
[,1]
alpha 3
beta 3
delta 2
epsilon 1
gamma 3
theta 1
zeta 2
What is the best way to convert a specific column in each list object to a specific format?
For instance, I have a list with four objects (each of which is a data frame) and I want to change column 3 in each data.frame from double to integer?
I'm guessing something along the line of lapply but I didn't know what specific synthax to use. I was trying:
lapply(df,function(x){as.numeric(var1(x))})
but it wasn't working.
Thanks!
Yes, lapply works well here:
lapply(listofdfs, function(df) { # loop through each data.frame in list
df[ , 3] <- as.integer(df[ , 3]) # make the 3rd column of type integer
df # return the new data.frame
})
This is just an alternative to C. Braun's answer.
You can also use map() function from the purr library.
Input:
library(tidyverse)
df <- tibble(a = c(1, 2, 3), b =c(4, 5, 6), d = c(7, 8, 9))
myList <- list(df, df, df)
myList
Method:
map(myList, ~(.x %>% mutate_at(vars(3), funs(as.integer(.)))))
Output:
[[1]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[2]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[3]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
You can use this:
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
Simple example:
data <- data.frame(cbind(c("1","2","3","4",NA),c(1:5)),stringsAsFactors = F)
typeof(data[,1]) #character
dlist <- list(data,data,data)
coltochange <- 1
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
typeof(dlist[[1]][,1]) #character
typeof(dlist2[[1]][,1]) #double
Why doesn't this work?
df <- data.frame(x=1:2, y = 3:4, z = 5:6)
df[] <- df[c("z", "y", "x")]
df
#> x y z
#> 1 5 3 1
#> 2 6 4 2
notice that the names are in the original order, but the data itself has changed order.
This works just fine
df <- data.frame(x=1:2, y = 3:4, z = 5:6)
df[c("z", "y", "x")]
#> z y x
#> 1 5 3 1
#> 2 6 4 2
When an extraction is completed the values in the index are replaced not the names. For example, replacing the first item below does not affect the name of the element:
x <- c(a=1, b=2)
x[1] <- 3
x
a b
3 2
In your data frame you replaced the values in the same way. The values changed but the names stayed constant. To reorder the data frame avoid the extraction framework.
df <- df[c("z", "y", "x")]
Just don't put the [] after the df and it will do as you want...
df <- data.frame(x=1:2, y = 3:4, z = 5:6)
df <- df[c("z", "y", "x")]
df
# z y x
#1 5 3 1
#2 6 4 2
And if you question is about why, Pierre Lafortune's comment is right.
as a side note, I also like to add the commat to separate dimension:
df <- df[,c("z", "y", "x")]
I find it more proper.
There are a lot of posts about replacing NA values. I am aware that one could replace NAs in the following table/frame with the following:
x[is.na(x)]<-0
But, what if I want to restrict it to only certain columns? Let's me show you an example.
First, let's start with a dataset.
set.seed(1234)
x <- data.frame(a=sample(c(1,2,NA), 10, replace=T),
b=sample(c(1,2,NA), 10, replace=T),
c=sample(c(1:5,NA), 10, replace=T))
Which gives:
a b c
1 1 NA 2
2 2 2 2
3 2 1 1
4 2 NA 1
5 NA 1 2
6 2 NA 5
7 1 1 4
8 1 1 NA
9 2 1 5
10 2 1 1
Ok, so I only want to restrict the replacement to columns 'a' and 'b'. My attempt was:
x[is.na(x), 1:2]<-0
and:
x[is.na(x[1:2])]<-0
Which does not work.
My data.table attempt, where y<-data.table(x), was obviously never going to work:
y[is.na(y[,list(a,b)]), ]
I want to pass columns inside the is.na argument but that obviously wouldn't work.
I would like to do this in a data.frame and a data.table. My end goal is to recode the 1:2 to 0:1 in 'a' and 'b' while keeping 'c' the way it is, since it is not a logical variable. I have a bunch of columns so I don't want to do it one by one. And, I'd just like to know how to do this.
Do you have any suggestions?
You can do:
x[, 1:2][is.na(x[, 1:2])] <- 0
or better (IMHO), use the variable names:
x[c("a", "b")][is.na(x[c("a", "b")])] <- 0
In both cases, 1:2 or c("a", "b") can be replaced by a pre-defined vector.
Building on #Robert McDonald's tidyr::replace_na() answer, here are some dplyr options for controlling which columns the NAs are replaced:
library(tidyverse)
# by column type:
x %>%
mutate_if(is.numeric, ~replace_na(., 0))
# select columns defined in vars(col1, col2, ...):
x %>%
mutate_at(vars(a, b, c), ~replace_na(., 0))
# all columns:
x %>%
mutate_all(~replace_na(., 0))
Edit 2020-06-15
Since data.table 1.12.4 (Oct 2019), data.table gains two functions to facilitate this: nafill and setnafill.
nafill operates on columns:
cols = c('a', 'b')
y[ , (cols) := lapply(.SD, nafill, fill=0), .SDcols = cols]
setnafill operates on tables (the replacements happen by-reference/in-place)
setnafill(y, cols=cols, fill=0)
# print y to show the effect
y[]
This will also be more efficient than the other options; see ?nafill for more, the last-observation-carried-forward (LOCF) and next-observation-carried-backward (NOCB) versions of NA imputation for time series.
This will work for your data.table version:
for (col in c("a", "b")) y[is.na(get(col)), (col) := 0]
Alternatively, as David Arenburg points out below, you can use set (side benefit - you can use it either on data.frame or data.table):
for (col in 1:2) set(x, which(is.na(x[[col]])), col, 0)
This is now trivial in tidyr with replace_na(). The function appears to work for data.tables as well as data.frames:
tidyr::replace_na(x, list(a=0, b=0))
Not sure if this is more concise, but this function will also find and allow replacement of NAs (or any value you like) in selected columns of a data.table:
update.mat <- function(dt, cols, criteria) {
require(data.table)
x <- as.data.frame(which(criteria==TRUE, arr.ind = TRUE))
y <- as.matrix(subset(x, x$col %in% which((names(dt) %in% cols), arr.ind = TRUE)))
y
}
To apply it:
y[update.mat(y, c("a", "b"), is.na(y))] <- 0
The function creates a matrix of the selected columns and rows (cell coordinates) that meet the input criteria (in this case is.na == TRUE).
We can solve it in data.table way with tidyr::repalce_na function and lapply
library(data.table)
library(tidyr)
setDT(df)
df[,c("a","b","c"):=lapply(.SD,function(x) replace_na(x,0)),.SDcols=c("a","b","c")]
In this way, we can also solve paste columns with NA string. First, we replace_na(x,""),then we can use stringr::str_c to combine columns!
Starting from the data.table y, you can just write:
y[, (cols):=lapply(.SD, function(i){i[is.na(i)] <- 0; i}), .SDcols = cols]
Don't forget to library(data.table) before creating y and running this command.
This needed a bit extra for dealing with NA's in factors.
Found a useful function here, which you can then use with mutate_at or mutate_if:
replace_factor_na <- function(x){
x <- as.character(x)
x <- if_else(is.na(x), 'NONE', x)
x <- as.factor(x)
}
df <- df %>%
mutate_at(
vars(vector_of_column_names),
replace_factor_na
)
Or apply to all factor columns:
df <- df %>%
mutate_if(is.factor, replace_factor_na)
For a specific column, there is an alternative with sapply
DF <- data.frame(A = letters[1:5],
B = letters[6:10],
C = c(2, 5, NA, 8, NA))
DF_NEW <- sapply(seq(1, nrow(DF)),
function(i) ifelse(is.na(DF[i,3]) ==
TRUE,
0,
DF[i,3]))
DF[,3] <- DF_NEW
DF
For completeness, built upon #sbha's answer, here is the tidyverse version with the across() function that's available in dplyr since version 1.0 (which supersedes the *_at() variants, and others):
# random data
set.seed(1234)
x <- data.frame(a = sample(c(1, 2, NA), 10, replace = T),
b = sample(c(1, 2, NA), 10, replace = T),
c = sample(c(1:5, NA), 10, replace = T))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
# with the magrittr pipe
x %>% mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
# with the native pipe (since R 4.1)
x |> mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
Created on 2021-12-08 by the reprex package (v2.0.1)
it's quite handy with data.table and stringr
library(data.table)
library(stringr)
x[, lapply(.SD, function(xx) {str_replace_na(xx, 0)})]
FYI
this works fine for me
DataTable DT = new DataTable();
DT = DT.AsEnumerable().Select(R =>
{
R["Campo1"] = valor;
return (R);
}).ToArray().CopyToDataTable();