I wish to gives values in a vector names. I know how to do that but in this case I have many names and many values, both within vectors within lists, and typing them by hand would by suicide.
This method:
> values <- c('jessica' = 1, 'jones' = 2)
> values
jessica jones
1 2
obviously works. However, this method:
> names <- c('jessica', 'jones')
> values <- c(names[1] = 1, names[2] = 2)
Error: unexpected '=' in "values <- c(names[1] ="
Well... I cannot understand why R refuses to read these as pure characters to assign them as names.
I realize I can create values and names separately and then assign names as names(values) but again, my actual case is far more complex. But really I would just like to know why this particular issue occurs.
EDIT I: The ACTUAL data I have is a list of vectors, each is a different combination of amounts of ingredients, and then a giant vector of ingredient names. I cannot just set the name vector as names, because the individual names need to be placed by hand.
EDIT II: Example of my data structure.
ingredients <- c('ing1', 'ing2', 'ing3', 'ing4') # this vector is much longer in reality
amounts <- list(c('ing1' = 1, 'ing2' = 2, 'ing4' = 3),
c('ing2' = 2, 'ing3' = 3),
c('ing1' = 12, 'ing2' = 4, 'ing3' = 3),
c('ing1' = 1, 'ing2' = 1, 'ing3' = 2, 'ing4' = 5))
# this list too is much longer
I could type each numeric value's name individually as presented, but there are many more, and so I tried instead to input the likes of:
c(ingredients[1] = 1, ingredients[2] = 2, ingredients[4] = 3)
But this throws an error:
Error: unexpected '=' in "amounts <- list(c(ingredients[1] ="
We can use setNames
setNames(1:2, names)
Another option is deframe if we have a two column dataset
library(tibble)
tibble(names, val = 1:2) %>%
deframe
Related
I have found quite a few threads on each part of the code snippet I am trying to create/use.. but not in the way(s) I am trying to do it.
I have a dataframe of customer information.
1 column is a customer ID (CID), the 2nd column is the customer specific identifier (CSI)
That means customer a single customer id can represent many specific customers from a bigger pool, and the CSI tells me which specific customer from that pool I am looking at.
Data would look like this:
data.frame("CID"=c("1","2","3","4","1","2","3","4"),
"Customer_Pool"=c("Art_Supplies", "Automotive_Supplies", "Office_Supplies", "School_Supplies",
"Art_Supplies", "Automotive_Supplies", "Office_Supplies", "School_Supplies"),
"CSI"=c("01","01","01","01","02","02","02","02"),
"Customer_name"=c("Janet","Jane", "Jill", "Jenna", "Joe", "Jim", "Jack", "Jimmy"))
I am trying to combine the CID and CSI numbers.. the problem is I need all the CID to be double digit (01 instead of 1 for example) to match the CID from 10-99
Here is what I have been trying:
DF <- DF %>% mutate(CID = if_else(str_length(CID = 1),
str_pad(CID, width = 2, side = "left), CID))
The error I am getting says: error in str_length(CID = 1): unused argument (CID = 1)
How would I correct this?
You have some syntax issues here. Try
DF <- DF %>% mutate(CID = if_else(str_length(CID) == 1,
str_pad(CID, width = 2, side = "left", pad="0"), CID))
When you call str_length(CID = 1), it looks like you are passing a parameter named "CID" to str_length which it knows nothing about. Rather, you wan to take the string length of CID and then compare that to 1 with == to test for equality (not = which is for parameter names and assignments).
But really the if_else isn't necessary here. If everyhing has to be 2 digits, then just do
DF <- DF %>% mutate(CID = str_pad(CID, width = 2, side = "left", pad="0"))
str_pad will only pad when needed.
Base R solution:
df$p_key <- with(df, paste(ifelse(nchar(CID) == 1, paste0("0", CID), CID), CSI, sep = "-"))
Tidyverse using Mr Flick's clean solution:
library(tidyverse)
df %>%
mutate(p_key = str_c(str_pad(CID, width = 2, side = "left", , pad = "0"), CSI, sep = "-"))
I have a dataset where I am trying to, by row, check about 25 columns to see if they contain a value from a list. I am not having a problem referencing the list of values to search for, but I am having trouble searching multiple columns at once. I initially thought to create a list of columns to reference, but that doesn't see to be working because you can't use a list.
Right now, I am checking each column individually for a set of values, but I was hoping to do this with less code because I will want to reference this set of columns more than once while cleaning these data. This is what I am currently using:
Dx.Elem<-list(c("DX1", "DX2", "DX3", "DX4", "DX5", "DX6", "DX7", "DX8", "DX9", "DX10", "DX11", "DX12", "DX13", "DX14", "DX15", "DX16", "DX17", "DX18",
"DX19", "DX20", "DX21", "DX22", "DX23", "DX24", "DX25"))
Dx.Panc9<-list("86384", "86394", "86382", "86392", "86381", "86391", "86383", "86393")
mydata2$Panc9<-0
mydata2$Panc9[mydata2$DX1 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX2 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX3 %in% Dx.Panc9]<-1
mydata2$Panc9[mydata2$DX4 %in% Dx.Panc9]<-1
The assignment of 1s actually goes to referencing mydata2$DX25, I just cut it off here to spare redundancy.
I have tried substituting referencing a list, but that doesn't work because it can't use a list.
mydata2$Panc9[mydata2[, Dx.Elem] %in% Dx.Panc9]<-1
and I get this error
Error in .subset(x, j) : invalid subscript type 'list'
Is there a way to use a list to achieve what I am trying to achieve?
Thank you for any help.
For your specific case:
lapply(mydata2[Dx.Elem], `%in%`, Dx.Panc9)
With some example data:
# create example data
set.seed(1234)
df <- data.frame(
x1 = round(runif(100, 1, 10)),
x2 = round(runif(100, 1, 10)),
x3 = round(runif(100, 1, 10)),
x4 = round(runif(100, 1, 10)),
x5 = round(runif(100, 1, 10))
)
# vector of numbers to search for (like Dx.Panc9)
numcheck <- c(2, 4)
# columns of data.frame in which to search (like Dx.Elem)
mycols <- c("x2", "x3", "x4", "x5")
# perform the check
result_list <- lapply(df[mycols], `%in%`, numcheck)
This returns a list where each element is a vector of length nrow(df). If your question is whether any column contains any the desired numbers, you can do something like this:
result_df <- data.frame(result_list)
rowSums(result_df) > 0
I've been working on a process to create all possible combinations of unique integers for lengths 1:n. I found the nCr function (combn function in the combinat package to be useful here).
Once all unique occurrences are iterated, they are appended to a consolidation table that contains any possible length+combination of the digits 1:n. A subset of the final table's relevant column (one record) looks like this (column is named String and the subset table f1):
c(1,3,4,5,9,10)
I need to select these columns from a secondary data source (df) one at a time (I am going to loop through this table), so my logic was to use this code:
df[,f1$String]
However, I get a message that says that undefined columns are selected, but if I copy and paste the contents of the cell such as:
df[,c(1, 3, 4, 5, 9, 10)]
it works fine ... I've tried all I can think of at this point; if anyone has some insight it would be greatly appreciated.
Code to reproduce is:
library(combinat)
library(data.table)
library(plyr)
rm(list=ls())
NCols=10
NRows=10
myMat<-matrix(runif(NCols*NRows), ncol=NCols)
XVars <- as.data.frame(myMat)
colnames(XVars) <- c("a","b","c","d","e","f","g","h","i","j")
x1 <- as.data.frame(colnames(XVars[1:ncol(XVars)]))
colnames(x1) <- "Independent.Variable"
setDT(x1)[, Index := .GRP, by = "Independent.Variable"]
colClasses = c("character", "numeric", "numeric")
col.names = c("String", "r!", "n!")
Combination <- read.table(text = "", colClasses = colClasses, col.names = col.names)
for(i in 1:nrow(x1)){
x2<- as.data.frame(combn(nrow(x1),i))
for (i in 1:ncol(x2)){
x3 <- paste("c(",paste(x2[1:nrow(x2),i], collapse = ", "), ")", sep="")
x3 <- as.data.frame(x3)
colnames(x3) <- "String"
x3 <- mutate(x3, "r!" = nrow(x2))
x3 <- mutate(x3, "n!" = nrow(x1))
Combination <- rbind(Combination, x3)
}
}
setDT(Combination)[, Index := .GRP, by = c("String", "r!", "n!")]
f1 <- Combination[717,]
f1$String <- as.character(f1$String)
## reference to data frame
myMat[,(f1$String)]
## pasted element
myMat[, c(1, 3, 4, 5, 9, 10)]
f1$String is the string "c(1, 3, 4, 5, 9, 10)". When you use myMat[,(f1$String)], R will look for the column with name "c(1, 3, 4, 5, 9, 10)". To get column numbers 1,3,4,5,9,10, you have to parse the string to an R expression and evaluate it first:
myMat[,eval(parse(text=f1$String))]
As #user3794498 noticed, you set f1$String as.character() so you cannot use is to get the columns you want.
You can change the way you define f1 or extract the column numbers from f1$String. Something like this should also work (load stringr before) myMat[, f1$String %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric].
Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))
I'm trying to replace elements of a data.frame containing "#N/A" with "NULL", and I'm running into problems:
foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8))
indices_of_NAs <- which(foo == "#N/A")
replace(foo, indices_of_NAs, "NULL")
Error in [<-.data.frame(*tmp*, list, value = "NULL") :
new columns would leave holes after existing columns
I think that the problem is that my index is treating the data.frame as a vector, but that the replace function is treating it differently somehow, but I'm not sure what the issue is?
NULL really means "nothing", not "missing" so it cannot take the place of an actual value - for missing R uses NA.
You can use the replacement method of is.na to directly update the selected elements, this will work with a logical result. (Using which for indices will only work with is.na, direct use of [ invokes list access, which is the cause of your error).
foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8))
NAs <- foo == "#N/A"
## by replace method
is.na(foo)[NAs] <- TRUE
## or directly
foo[NAs] <- NA
But, you are already dealing with strings (actually a factor by default) in your od column by forced coercion when it was created with c(), and you might need to treat columns individually. Any numeric column will never have a match on the string "#N/A", for example.
Why not
x$col[is.na(x$col)]<-value
?
You wont have to change your dataframe
The replace function expects a vector and you're supplying a data.frame.
You should really try to use NA and NULL instead of the character values that you're currently using. Otherwise you won't be able to take advantage of all of R's functionality to handle missing values.
Edit
You could use an apply function, or do something like this:
foo <- data.frame(day= c(1, 3, 5, 7), od = c(0.1, NA, 0.4, 0.8))
idx <- which(is.na(foo), arr.ind=TRUE)
foo[idx[1], idx[2]] <- "NULL"
You cannot assign a real NULL value in this case, because it has length zero. It is important to understand the difference between NA and NULL, so I recommend that you read ?NA and ?NULL.