Looping over columns in data.table R - r

I am trying to loop over columns in data.table package in R. I have been having trouble trying to get the for loop to accurately input the column when I subset the datatable.
The goal of my task is to get the number of rows for each subset of the datatable when the column condition of "==1" is met.
Here is my code:
data <- data.table(va=c(1,0,1), vb=c(1,0,0), vc=c(1,1,1))
names <- c("va", "vc")
for (col in names) {
print(nrow(data[col == 1,]))
print(col)
}
Here is the output I get
[1] 0
[1] "va"
[1] 0
[1] "vc"
Is there something I am missing or a better way of doing this?

You can use colSums, which is much simpler and faster than looping.
dt <- data.table(va=c(1,0,1), vb=c(1,0,0), vc=c(1,1,1))
col.names <- c("va", "vc")
dt[, colSums(.SD==1), .SDcols = col.names]
# va vc
# 2 3
Note: I changed your object names to dt and col.names because it is not good practice to use base functions as names.
If you really want to use a for loop (I don't recommend it, but for educational purposes...) you can fix it using get to use the values of the column rather than the column name itself
for (col in col.names) {
dt[get(col) == 1, print(.N)]
}

Related

Difference between two columns with separated variables by ; in R

I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.

r- transforming columns, calling them by $name, using a loop

imported tibble from textfile. Many numeric columns are imported as "chr". I guess it's because they contain a "," instead of a ".".
My goal is to write a loop which runs through the names of desired columns, replaces "," with "." and converts columns into "num".
Little example:
data <- data.frame("A1" =c("2,1","2,1","2,1"), "A2" =c("1,3","1,3","1,3"),
stringsAsFactors = F) %>% as.tibble() #example data
colname <- c("A1", "A2") #creating variable for loop
for(i in colname) {
nam <- paste0("data$", i)
assign(nam, as.numeric(gsub(",",".", eval(parse(text = paste0("data$",i))))) )
}
Instead of overwriting the existing column, R creates a new variable:
data$A1 # that's the existing column as part of the tibble
[1] "2,1" "2,1" "2,1"
`data$A1` # thats just a new variable. mind the little``
[1] 2.1 2.1 2.1
I also tried to assign (<-) the new numeric values via eval, but that does not work either.
eval(parse(text = paste0("data$", i))) <- as.numeric(
gsub(",",".", eval(parse(text = paste0("data$",i)))))
Error: target of assignment expands to non-language object
Any suggestions on how to transform? I have the same issue with other columns that I want to aggregate to a new variable. This variable should also be part of the existing tibble. I could do it by hand. This would take lots of time and probably produce many mistakes.
Thanks a lot!
Sam
As you are already working with the tidyverse, you can use dplyr::mutate_at and the colname variable you have already defined.
data %>%
mutate_at(.vars = colname,
.funs = function(x) { as.numeric(gsub(",", ".", x)) })

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

Apply and function a gsub in a lots of columns

I am kind of new to R and I want to apply the gsub function in columns 6 to 12 in my data frame called x. However, I am kind of stuck. I started using:
gsub("\\.", "",x[,c(6:12)])
But then it returned only a few rows.
Then, I tried to use this:
x1<-apply(x,c(6:12),function(x) gsub("\\.", "",x))
But I got the following error:
Error in if (d2 == 0L) { : missing value where TRUE/FALSE needed
Also tried this:
for (i in x[,c(6:12)])
{a<-data.frame(gsub("\\.", "",i))}
Does anybody have a tip or a solution?
It would also be great if someone showed me how to use an apply function and for.
Here is another solution. It returns all the columns of the original dataframe
library(dplyr)
mutate_at(x, 6:12, gsub("\\.", "", .))
Here is a solution in a data.table way. It worth considering when a table is big and time is a critical factor.
library(data.table) # load library
x = as.data.table(x) # convert data.frame into data.table
cols = names(x)[6:12] # define which columns to work with
x[ , (cols) := lapply(.SD, function(x) {gsub("\\.", "", x)}), .SDcols = cols] # replace
x # enjoy results
Adding the solution using apply from the comments, as this one really helped me:
x1 <- apply(x[,6:12], 2, function(x) gsub("\\.", "",x))

R Loop optimisation/ Loop is way too time consuming

The following loop takes ages. Is there any way to this in a more time-efficient way? The following data.table consists of 27 variables and more than 600k observations.
data <- read.table("file.txt", header = T, sep= "|")
colnames(data)[c(1)] <- c("X")
data <- as.data.table(data)
n=1;
vector <- vector()
for(i in 2:nrow(data))
{
if(data[["X"]][i] != data[["X"]][i-1])
{
n=1; vector[i]=1}
else {
n=n+1; vector[i]=n}}
Basically, I need to index every appearance of a unique entry in X, i.e. the first time it appeared, the second time it appeared, etc and then merge this to the existing data as additional column. However, I got stock at compiling the vector.
Thank you.
First off, use fread:
DT <- fread("file.txt", sep = "|")
Next, use setnames:
setnames(DT, 1, "X")
Finally, use rowid:
DT[ , vector := rowid(X)]

Resources