setting multiple columns NA's to value --R [duplicate] - r

This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Fastest way to replace NAs in a large data.table
(10 answers)
Closed 6 years ago.
Quite new to R, I am trying to subselect certain columns in order to set their NA's to 0.
so far I have:
col_names1 <- c('a','b','c')
col_names2 <- c('e','f','g')
col_names <- c(col_names1, col_names2)
data = fread('data.tsv', sep="\t", header= FALSE,na.strings="NA",
stringsAsFactors=TRUE,
colClasses=my_col_Classes
)
setnames(data, col_names)
data[col_names2][is.na(data[col_names2])] <- 0
But I keep getting the error
Error in `[.data.table`(`*tmp*`, column_names2): When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
I believer this error is saying I have the wrong order but I am not sure how I do?

You can do it with data.table assign :=
data <- data.table(a = c(2, NA, 3, 5), b = c(NA,2,3,4), c = c(2,5,NA, 6))
fix_columns <- c('a','b')
fix_fun <- function(x) ifelse(is.na(x), 0 , x)
data[,(fix_columns):=lapply(.SD, fix_fun), .SDcols=fix_columns]
P.S. You cant select columns from data.table like data[col_names2]. If you want select them by character vector, one approach is : data[, col_names2, with = F]

Related

R convert NA value to character for string comparison [duplicate]

This question already has answers here:
How to treat NAs like values when comparing elementwise in R
(4 answers)
Closed 1 year ago.
I have a dataframe that contains 2 columns with character strings. The goal is to see how many of them are identical including NA values. If both columns give NA, it should be treated as identical.
class(df$column_1) # it shows characters
length(which(df$column_1 == df$column_2)) # the result exclude the NA rows
Try to ask in addition to is.na:
length(which(x$a == x$b | (is.na(x$a) & is.na(x$b))))
#[1] 2
Data:
x <- data.frame(a=c("a", NA, "b"), b=c("c", NA, "b"))
Another way would be using identical() (which has a nice property that identical(NA, NA) = TRUE) term by term with a loop:
Dummy data:
a=c("a",NA,"b")
b=c(NA,NA,"d")
df = data.frame(a, b, stringsAsFactors=FALSE)
Code:
count = 0
for(i in 1:nrow(df)){
count = count + identical(df[i,1],df[i,2])}
Output:
>count
>1

How can I remove the commas in a numeric value in R? [duplicate]

This question already has answers here:
How to read data when some numbers contain commas as thousand separator?
(11 answers)
Closed 5 years ago.
column19 <- 19
mdf[,column19] <- lapply(mdf[,column19],function(x){as.numeric(gsub(",", "", x))})
this snippet works but also results in duplicate values
If there is only a single column, we don't need lapply
mdf[, column19] <- as.numeric(gsub(",", "", mdf[, column19], fixed = TRUE))
The reason why the OP's code didn't work out as expected is because lapply on a single column after converting it to a vector (mdf[, column19]) and loop through each of the single element of the column and return a list. Now, we are assigning the output of list back to that single column
column19 <- 19
mdf[,column19] <- lapply(mdf[,column19],function(x) as.numeric(gsub(",", "", x)))
Warning message: In [<-.data.frame(*tmp*, , column19, value =
list(27, 49, 510, : provided 5 variables to replace 1 variables
Instead, if we want to use the same procedure either keep the data.frame structure by mdf[column19] or mdf[, column19, drop = FALSE] and then loop using lapply. In this way, it will be a list with a single vector
mdf[column19] <- lapply(mdf[column19],function(x) as.numeric(gsub(",", "", x)))
This is more related to dropping of dimensions when using [ on a single column or row as by default it is drop = TRUE.
data
set.seed(24)
mdf <- as.data.frame(matrix(sample(paste(1:5, 6:10, sep=","),
5*20, replace = TRUE), 5, 20), stringsAsFactors=FALSE)

subsetting columns in a datatable [duplicate]

This question already has answers here:
Select subset of columns in data.table R [duplicate]
(7 answers)
Closed 6 years ago.
example R dataframe:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
I can easily subset columns in a dataframe like this:
df.smaller <- df[c(1,2)]
n s
1 2 aa
2 3 bb
3 5 cc
Very handy!
However, with datatable (and I thought it is easier with datatable) I have not found such a quick way to do the same. How can I quick and easy do the same to a datatable?
dt = data.table(df)
dt.smaller <- dt[c(1,2)]
n s b
1: 2 aa TRUE
2: 3 bb FALSE
will return me the first two rows. Probably it is just a comma or something I have to change, but I can't figure it out.
We need to use with = FALSE
dt[, 1:2, with = FALSE]
This is explained in the ?data.table
with: By default with=TRUE and j is evaluated within the frame of x;
column names can be used as variables.
When with=FALSE j is a character vector of column names, a numeric
vector of column positions to select or of the form startcol:endcol,
and the value returned is always a data.table. with=FALSE is often
useful in data.table to select columns dynamically

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

Rearrange dataframe by subsetting and column bind [duplicate]

This question already has an answer here:
Merging rows with the same ID variable [duplicate]
(1 answer)
Closed 7 years ago.
I have the following dataframe:
st <- data.frame(
se = rep(1:2, 5),
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2))
st$xy <- paste(st$X,",",st$Y)
st <- st[c("se","xy")]
but I want it to be the following:
1 2 3 4 5
-1.53697673029089 , 2.10652020463275 -1.02183940974772 , 0.623009466458354 1.33614674072657 , 1.5694345481646 0.270466789820086 , -0.75670874554064 -0.280167896821629 , -1.33313822867893
0.26012874418111 , 2.87972571647846 -1.32317949800031 , -2.92675188421021 0.584199000313255 , 0.565499464846637 -0.555881716346136 , -1.14460518414649 -1.0871665543915 , -3.18687136890236
I mean when the value of se is the same, make a column bind.
Do you have any ideas how to accomplish this?
I had no luck with spread(tidyr), and I guess it's something which involves sapply, cbind and a if statement. Because the real data involves more than 35.000 rows.
It seems as though your eventual goal is to have a data file which has roughly 35000 columns. Are you sure about that? That doesn't sound very tidy.
To do what you want, you are going to need to have a row identifier. In the below, I've called it caseid, and then removed it once it was no longer required. I then transpose the result to get what you asked for.
library(tidyr)
library(dplyr)
st <- data.frame(
se = rep(1:2, 5),
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2))
st$xy <- paste(st$X,",",st$Y)
st <- st[c("se","xy")]
st$caseid = rep(1:(nrow(st)/2), each = 2) # temporary
df = spread(st, se, xy) %>%select(-caseid) %>%t()
print(df)
If we need to split the 'xy' column elements into individual units, cSplit from splitstackshape can be used. Then rbind the alternating rows of 'st1' after unlisting`.
library(splitstackshape)
st1 <- cSplit(st, 'xy', ', ', 'wide')
rbind(unlist(st1[c(TRUE,FALSE)][,-1, with=FALSE]),
unlist(st1[c(FALSE, TRUE)][,-1, with=FALSE]))
If we don't need to split the 'xy' column into individual elements, we can use dcast from data.table. It should be fast enough. Convert the 'data.frame' to 'data.table' (setDT(st), create a sequence column ('N') by 'se', and then dcast from 'long' to 'wide'.
library(data.table)
dcast(setDT(st)[, N:= 1:.N, se], se~N, value.var= 'xy')

Resources