I am trying to get the minimum of a a column.
The data has been split into groups using the "abbr" factor. My objective is to return the data in column 2 corresponding to the minimum in column number passed in the argument. If it helps , this is a part of the coursera R programming introductory course.
The minimum is supposed to be somewhere around 8, it shows 10.
Please help me here.
here's the link to the csv file on which i used read.csv
https://drive.google.com/file/d/0Bxkj3-FNtxqrLW14MFZCeEl6UGc/view?usp=sharing
best <- function(abbr, outvar){
## outcome is a dataframe consisting of a column labelled "State" (one of many)
## outvar is the desired column number
statecol <- split(outcome, outcome$State) ##state is a factor which will be inputted as abbr
dislist <- statecol[[abbr]][,2][statecol[[abbr]][, outvar] ==
min(statecol[[abbr]][, outvar])] ##continuation of prev line
dislist
}
In my opinion you are messing up with NA, make sure to specify na as not available and na.rm=TRUE in min..
filedata<-read.table(file.choose(),quote='"',sep=",",dec=".",header=TRUE,stringsAsFactors=FALSE, na.strings="Not Available")
f<-function(df,abbr,outVar,na.rm=TRUE){
outlist<-split(df,df["State"])
tempCol<-outlist[[abbr]][outVar]
outlist[[abbr]][,2][which(tempCol==min(tempCol,na.rm=na.rm))]
}
f(filedata,"AK",44)
Related
## my data frame
crime = read.csv("url")
## specific columns that need to be represented
property_crime = crime$Burglary + crime$Theft + crime$`Motor Vehical Theft`
## the rows that I am looking for have the name "harris" within the column named "county_name"
## my attempt
with(crime, hist(harris))
## Error in hist(harris) : object 'harris' not found
Not sure why I am getting object 'harris' not found as that is the name under the county_name column. I'm new to R, could someone walk me through the process of displaying a histogram only including the values of specific columns and specific rows?
the rows that I am looking for have the name "harris" within the column named "county_name"
You have to tell R the same logic that you are telling us.
There are several ways of making this in R but I am going to put here the base R way.
We can access the desired rows of object crime column county_name by indexing like data.frame[rows, columns]. So, in your case, crime[harris_rows, "county_name"] should work. To get harris_rows, we can make a boolean index like so crime$county_name == harris. If we put all of this together and call hist():
hist(crime[crime$county_name == "harris", "county_name"])
You don't provide a reproducible example, but you can check a similar logic with the mtcars dataset. Here, I am making the histogram of the cars with mpg > 15
hist(mtcars[mtcars$mpg >15, "mpg"])
# this is another option that produces the same result
# hist(mtcars$mpg[mtcars$mpg >15])
Suppose I have a dataset of the following form:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities.
Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.
Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.
Can someone help me understand why this is happening?
For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
Thanks a lot in advance.
Let's take a look at your code.
I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr but don't use the powerful tools provided by this package.
Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.
I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))
I'm working with a dataframe that has really long names that is more than 25 characters. I'm trying to make a bar graph (with plotly) with all of these organizations name, but the names get cut off because they're super long. I've already tried to the margins like the following:
plot_ly(x = number, y = org_name, type = 'bar') %>%
layout(margin = list(l = 150))
It works but the bar graph doesn't look nice so the alternative I'm trying to do is abbreviate any organization's name that are longer than 25 characters. However, I'm having a hard time doing so. One way I tried to abbreviate it is to create a new column called abbrv, use substring to get the first 25 characters of the organization name and then do "...", and then put it in the column. While for the organization's name that isn't greater than 25, I would just put an NA in the abbrv column like the following:
for(i in dataframe.name$org_name){
if(nchar(i) > 25){
dataframe.name$abbrv <- paste0(substring(i, 0, 25), "...")
}
else{
dataframe.name$abbrv <- "NA"
}
The only thing with this way is now that I have the abbrv column (if it works), how will I make sure that plotly displays the abbrv column if the organization name is greater than 25 characters and if it doesn't then it displays the normal organization name.
Anyways, I talked enough about that, but that was one approach I tried to do, but it doesn't quite work since the abbrv column puts "NA" for ALL of the rows in the column, no matter how long the organization's names are. Another approach I was trying to do is use the replace function such as:
for(i in dataframe.name$org_name){
if(nchar(i) > 25){
dataframe.name[i].replace(
to_replace=i,
value= abbreviate(i)
)
}
But I get errors for that one as well. At this point, I'm not even sure what to do and how to abbreviate the long names in my dataframe? I'm really lost and confused on what to do and how to exactly abbreviate the long names. If anyone can help me out, that'll be great! Thanks.
*******Edit*******
So now I'm using this code:
for(i in 1:nrow(dfname)){
if(nchar(dfname$orgname[i]) > 25){
dfname$abbrv.column <- substring(dfname$orgname[i], 0, 25)
}
else{
dfname$abbrv.column <- dfname$orgname
}
}
This isn't quite working though because all of the entries are the same organization name
dataframe.name$abbr is a vector of all abbreviations in the dataframe, not just a single name.
It is the reason all entries in dataframe.name$abbr are being set to NA; the last name is in the dataframe is 25 characters or less, so all entries in dataframe.name$abbr are assigned NA.
#brettljausn has a decent suggestion: just do away with the NAs completely and only truncate where the character count exceeds 25.
Something like this should work a treat:
dataframe.name$abbrv <- substring( dataframe.name$org_name, 0, 25 )
I would try to use abbreviate first though:
dataframe.name$abbrv <- abbreviate( dataframe.name$org_name )
Base R abbreviate. Limit to 8 characters including the "."
> abbreviate(names(iris), minlength = 8)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"Spl.Lngt" "Spl.Wdth" "Ptl.Lngt" "Ptl.Wdth" "Species"
(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]