How to assign an edited dataset to a new variable in R? - r

The title might be misleading but I have the scenario here:
half_paper <- lapply(data_set[,-1], function(x) x[x==0]<-0.5)
This line is supposed to substitute 0 for 0.5 in all of the columns except the first one.
Then I want to take half_paper and put it into here where it would rank all of the columns except the first one in order.:
prestige_paper <-apply(half_paper[,-1],2,rank)
But I get an error and I think that I need to somehow make half_paper into a data set like data_set.
Thanks for all of your help

Your main issue 'This line is supposed to substitute 0 for 0.5 in all of the columns except the first one' can be remedied by placing another line in your anonymous function. The gets operator <- returns the value of whatever is on the right hand side, so your lapply was returning a value of 0.5 for each column. To remedy this, another line can be added to the function that returns the modified vector.
It's also worth noting that lapply returns a list. apply was substituted in for lapply in this case for consistency, but plyr::ddply may suit this specific need better.
half_mtcars <- apply(mtcars[, -1], 2, function(x) {x[x == 0] <- .5;return(x)})
prestige_mtcars_tail <- apply(half_mtcars, 2, rank)
prestige_mtcars <- cbind(mtcars[,1, drop = F], prestige_mtcars_tail)

Related

Should I use 'which' on filters?

When filtering a dataset you can use:
df[df$column==value,]
or
df[which(df$column==value),]
The first filter returns a logical vector. The second one returns a list of indexes (the ones which value is 'True' in that logical vector). Should I use one better than the other? I see that sometimes the first one returns a row with all values as NA...
Which of both expression is more correct?
Thanks!
You should (almost) always prefer the first version.
Why? Because it’s simpler. Don’t add unnecessary complexity to your code — programming is hard enough as it is, we do not want to make it even harder; and small complexities add to each other supra-linearly.
One case where you might want to use which is when your input contains NAs that you want to ignore:
df = data.frame(column = c(1, NA, 2, 3))
df[df$column == 1, ]
# 1 NA
df[which(df$column == 1), ]
# 1
However, even in this case I would not use which; instead, I would handle the presence of NAs explicitly to document that the code expects NAs and wants to handle them. The idea is, once again, to make the code as simple and self-explanatory as possibly. This implies being explicit about your intent, instead of hiding it behind non-obvious functions.
That is, in the presence of NAs I would use the following instead of which:
df[! is.na(df$column) & df$column == 1, ]

Problem deleting elements with 2 values in R list

I am trying to format a list such that I would have one word per value(I imported it from a very poor quality csv, and can't do much about improving the csv). I currently am trying to make it so that every element only has one value, however, the code I am currently using is not doing this, although I am not getting error messages.
Here is the code I am currently using:
Terms <- [] #9020 elements with lengths 1, 2, and 3
for (x in 1:length(Terms)){
if (Terms[[x]] %>% is.list()){
term <-Terms[[x]]
length(term) <- 1
Terms[[x]]<-term
}#should return list of same size, but only with elements of length 1
Any help figuring out what I could use to make it so that I can delete any second variables would be appreciated.
An option would be to create a logical condition with lengths and then use that for subsetting the list
lst2 <- lst1[lengths(lst1) == 1]
If the intention is to get only the first element
lst2 <- lapply(lst1, `[`, 1)
NOTE: Assuming the list elements are vectorss

Dynamically call dataframe column & conditional replacement in R

First question post. Please excuse any formatting issues that may be present.
What I'm trying to do is conditionally replace a factor level in a dataframe column. Reason being due to unicode differences between a right single quotation mark (U+2019) and an apostrophe (U+0027).
All of the columns that need this replacement begin with with "INN8", so I'm using
grep("INN8", colnames(demoDf)) -> apostropheFixIndices
for(i in apostropheFixIndices) {
levels(demoDfFinal[i]) <- c(levels(demoDf[i]), "I definitely wouldn't")
(insert code here)
}
to get the indices in order to perform the conditional replacement.
I've taken a look at a myriad of questions that involve naming variables on the fly: naming variables on the fly
as well as how to assign values to dynamic variables
and have explored the R-FAQ on turning a string into a variable and looked into Ari Friedman's suggestion that named elements in a list are preferred. However I'm unsure as to the execution as well as the significance of the best practice suggestion.
I know I need to do something along the lines of
demoDf$INN8xx[demoDf$INN8xx=="I definitely wouldn’t"] <- "I definitely wouldn't"]
but the iterations I've tried so far haven't worked.
Thank you for your time!
If I understand you correctly, then you don't want to rename the columns. Then this might work:
demoDf <- data.frame(A=rep("I definitely wouldn’t",10) , B=rep("I definitely wouldn’t",10))
newDf <- apply(demoDf, 2, function(col) {
gsub(pattern="’", replacement = "'", x = col)
})
It just checks all columns for the wrong symbol.
Or if you have a vector containing the column indices you want to check then you could go with
# Let's say you identified columns 2, 5 and 8
cols <- c(2,5,8)
sapply(cols, function(col) {
demoDf[,col] <<- gsub(pattern="’", replacement = "'", x = demoDf[,col])
})

R returns list instead of filling in dataframe column

I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.
Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.
As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))

How to access single elements in a table in R

How do I grab elements from a table in R?
My data looks like this:
V1 V2
1 12.448 13.919
2 22.242 4.606
3 24.509 0.176
etc...
I basically just want to grab elements individually. I'm getting confused with all the R terminology, like vectors, and I just want to be able to get at the individual elements.
Is there a function where I can just do like data[v1][1] and get the element in row 1 column 1?
Try
data[1, "V1"] # Row first, quoted column name second, and case does matter
Further note: Terminology in discussing R can be crucial and sometimes tricky. Using the term "table" to refer to that structure leaves open the possibility that it was either a 'table'-classed, or a 'matrix'-classed, or a 'data.frame'-classed object. The answer above would succeed with any of them, while #BenBolker's suggestion below would only succeed with a 'data.frame'-classed object.
There is a ton of free introductory material for beginners in R: CRAN: Contributed Documentation
?"[" pretty much covers the various ways of accessing elements of things.
Under usage it lists these:
x[i]
x[i, j, ... , drop = TRUE]
x[[i, exact = TRUE]]
x[[i, j, ..., exact = TRUE]]
x$name
getElement(object, name)
x[i] <- value
x[i, j, ...] <- value
x[[i]] <- value
x$i <- value
The second item is sufficient for your purpose
Under Arguments it points out that with [ the arguments i and j can be numeric, character or logical
So these work:
data[1,1]
data[1,"V1"]
As does this:
data$V1[1]
and keeping in mind a data frame is a list of vectors:
data[[1]][1]
data[["V1"]][1]
will also both work.
So that's a few things to be going on with. I suggest you type in the examples at the bottom of the help page one line at a time (yes, actually type the whole thing in one line at a time and see what they all do, you'll pick up stuff very quickly and the typing rather than copypasting is an important part of helping to commit it to memory.)
Maybe not so perfect as above ones, but I guess this is what you were looking for.
data[1:1,3:3] #works with positive integers
data[1:1, -3:-3] #does not work, gives the entire 1st row without the 3rd element
data[i:i,j:j] #given that i and j are positive integers
Here indexing will work from 1, i.e,
data[1:1,1:1] #means the top-leftmost element

Resources