Complex self-referencing of a dataframe

Complex self-referencing of a dataframe - r

I could not find anything that answered this question, so I apologize if it is a duplicate. I'm also not sure exactly how to phrase it.
Here is my example I created for stackoverflow - my real dataset is much more complex:
Here is the example dataframe I am using
The idea behind this is that this is a dataset of workers. Each worker has info in columns named name, age, State (where they are located), and State_Lead, a boolean column that represents whether or not if they are the worker who is in charge of that State.
My goal here is twofold - I want a code that
1) references the State and State_Lead columns and require 1 (Not zero, not >1) State_Lead =TRUE per State. If there is more than or less than 1, I want to randomize who in each State becomes the State Lead
2) Calls up the current State_Lead=TRUE for each State. Ideally I could reference a State and be able to call anything from the row of the State_Lead (where the rows are named the same as the Name column).
#I made Jack not the state lead so the goal should be to return James and Jill
Database["Jack", "State_Lead"]=FALSE
All_States <- unique(Database$State)
All_States
##Here I thought I could cycle through each state and return the rows that matched each State Leader
heads <- NULL
for(i in All_States){
heads <- append( heads, Database[, "State"==i])
}
heads
## heads just returns "list()"
###attempt 2
heads <- NULL
for(i in All_States){
if (sum(Database[Database[,"State"==i], "State_Lead"]) = 1)
heads <-append(heads, Database[,"State"==i], "State_Lead"])
else Database$State==i <- NA
all_in_state <- subset(Database[, State="i"])
sample(all_in_state, 1)
}

All right, so it looks like you're definitely brand new to programming as a whole, and not just R. So first and foremost, I'd highly recommend checking out some of the MOOCs on Coursera, such as this one. But, as for your question, let's look at each piece of it that seems to be causing confusion.
First, when asking for help on this site, it's always best to provide actual data, and not a picture of your dataset. Given that you already had a dataframe in R that you were working with, you could easily take advantage of the dput function and then copy that into your question. So, for example, you might have the following the dataframe:
df = data.frame(name=c("John", "Jim", "Sally"), state=c("MI", "FL", "NY"), state_leader=c(TRUE, FALSE, TRUE))
df
name state state_leader
1 John MI TRUE
2 Jim FL FALSE
3 Sally NY TRUE
Then we can just use dput(df) and get the following output:
dput(df)
structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Jim",
"John", "Sally"), class = "factor"), state = structure(c(2L,
1L, 3L), .Label = c("FL", "MI", "NY"), class = "factor"), state_leader = c(TRUE,
FALSE, TRUE)), .Names = c("name", "state", "state_leader"), row.names = c(NA,
-3L), class = "data.frame")
Those of us on Stack Overflow can now copy the output from dput and have a working copy of your dataset.
Next, let's look at your confusion around how to set new values in a dataset. In your updated text, you tried to set state_leader equal to FALSE with the following code df["John", "state_leader"] = FALSE. This is wrong for two reasons: 1) "John" doesn't point to anything. R has no idea what you mean when you just say "John". 2) Even assuming that first part of your indexing logic was correct, by simply putting "state_leader" in the second part of your index, you're telling R that you want that whole column to be equal to FALSE. The proper way to do what you wanted to do is with the following.
df[df$name == "John", "state_leader"] = FALSE
This way, R knows that you want the variable name to be equal to "John".
So now that we have that, it'd probably be a good time to look at the [ operator and understand how it works. Because your complex algorithm for trying to find your values is not nearly as complex as you think when you understand how indexing works.
If you have a one-dimensional object in R, such as a vector, [ takes one parameter. If you have a two-dimensional object, such as a dataframe or matrix, [ takes two parameters, either one of which is optional. Let's look at a few examples.
x = 1:10 # A one-dimensional vector
x[1:3] # Get the first three elements of x
x[c(1, 3, 5, 7, 9)] # Get all odd elements of x
x[x %% 2 != 0] # Get all odd elements of x
In the examples above, we're working with a one-dimensional vector. The three operations we perform highlight a couple key points about [. The first key point is that [ expects a numeric input, or something that can be converted to a numeric input. Second, the numeric inputs do not have to be consecutive. Lastly, the numeric input can be a function that returns a numeric result, such as x %% 2 != 0. This last example is perfect for demonstrating what I mean by "something that can be converted to a numeric input". You can think of this in the following way: First, R computes x %% 2. It then checks each element to see if it is equal to 0 or not, which returns a vector of Boolean values equal to TRUE or FALSE. It then checks which values are TRUE and returns a vector of indices equal to c(1, 3, 5, 7, 9), which is identical to our second example.
Now, let's look at df to see how [ works on two-dimensional objects. When working with 2D objects, the first parameter to [ tells it which rows you want, and the second parameter tells it which columns you want.
df[df$name == "John", ] # Get all rows where name equals "John" and ALL columns
df[, c(1, 3)] # Get all rows and only the first and third column
df[grepl("^J", df$name), 3] # Get all rows with names that start with "J" and only the third column
As we see above in the first two examples, you do not need to provide a value for each parameter in [. If you leave one of the values blank, the default is to return all available rows or columns from the object. You'll also notice that we specifically call the column name even when we're specifying rows, such as df[df$name == "John", ]. This is because we need R to understand which column we want to check to determine if we keep the row. Lastly, you should also notice that all of our prior understandings about [ in one-dimensional objects holds here. It expects a numeric input, or one that can be converted to a numeric input. So, in the first example, df$name == "John" will be result in a Boolean vector with values c(TRUE, FALSE, FALSE) and R will then check which values are TRUE and return a value of 1, indicating that only the first row matches that criteria.
So now that we understand how [ works, let's see how to use it to solve our question here. We know that we want all of the columns, so we can ignore the second parameter in [. And we know that we want only the rows where state_leader is TRUE. So let's use that condition in our index.
df[df$state_leader == TRUE, ]
name state state_leader
1 John MI TRUE
3 Sally NY TRUE
As an exercise to you, how would you make this output better by only returning the name and state variables?

Related

I am trying to run a nested for loop in r which subtracts each row of a variable in a data frame

Dataframe:
Number Time
1 10:25:00
2 10:35:15
3 10:42:26
For each number in the data frame I want to subtract Time, for example:
Number 1 = 10:25:00 - 10:35:15
Number 2 = 10:35:15 - 10:42:26
My code:
for (i in df$Number) {
for (j in df$Time) {
subtime <- df$Time[j] - df$Time[j+1]
}
}
This code only results in NA

Because subtime is reassigned in every loop, only the last value is returned when the loops finish. Further, at the last iteration, j == length(df$Time) so j + 1 is out of bounds, so df$Time[j + 1] will be NA, which means the entire result is NA.
Instead in general you can do:
df$subtime <- c(NA, diff(df$Time))
where NA is the first instance replaced by a suitable default for the first instance. Your case may require additional treatment depending on the exact class of df$Time.
(You should consider creating an MWE of your data if you need further help. What you provided is pretty close, but not quite enough for us to be of help.)

I think you may be looking for something like this. The result is given in hours.
For example: 10:25:00 - 10:35:15 = - 00:10:15 = - (10/60) - (15/3600) = -0.1708333
a = data.frame(Number = c(1, 2, 3), Time = c("10:25:00", "10:35:15", "10:42:26"), stringsAsFactors = FALSE)
x = 2
timeDiff = function(x, a){
as.difftime(a[x, 2]) - as.difftime(a[x+1, 2])
}
result = sapply(2:nrow(a), timeDiff, a)
result
Please note that it's impossible to compute such difference for case Number 3, ever since a fourth row would be necessary, and the data frame you provided has only 3 rows.
As per Stack Overflow's prompt, I can see you r a new user, Thus, for future nested for-loops, I recommend you explore sapply or lapply, as it will make your code look cleaner and easier to maintain.
If you need any further clarification, don't hesitate to comment my answer. :-)

running apply (or variant) like an embedded loop

I'd like to do something like an embedded loop, but using apply functions, the goal of which is to check various conditions prior to moving on to the next part of my program.
I have two objects, a list of product descriptions, which can be created as follows:
test_products <- list(c("dingdong","small","affordable","polished"),c("wingding","medium","cheap","dull"),c("doodad","big","expensive","shiny"))
And a data frame of combinations of features that are not allowed, where each row represents a disallowed combination of features. A sample data frame can be created as follows:
disallowed <- data.frame(trait1 = c("dingdong","wingding","doodad"),
trait2 = c("medium","big","big"),
stringsAsFactors = FALSE)
My goal is to check each product against each of the disallowed combinations as efficiently as possible. So far I can check one product against all prohibitions as follows (in this case, the third product):
apply(disallowed, 1, function(x) x %in% unlist(test_products[[3]]))
OR I can check all products against one of the disallowed combinations of traits (the third combination).
lapply(test_products, function(x) disallowed[3,] %in% x)
Is it possible to check all products against all rows of the data frame of disallowed feature combination, without using a loop?
My end result should look something like this:
Product 1: OK
Product 2: OK
Product 3: NOT OK
Since Product 3 runs afoul of the third disallowed row.

There are definitely more elegant ways, but I am going to share my thoughts on this.
First, the way you created the disallowed data frame is convoluted. I decided to use the following code to create disallowed.
# Create a data frame showing disallowed traits
disallowed <- data.frame(trait1 = c("dingdong","wingding","doodad"),
trait2 = c("medium","big","big"),
stringsAsFactors = FALSE)
I then created a function called violate, which has two arguments. The first argument product is a vector of character. The second argument, check_df, is the data frame contains disallowed traits.
The output of violate is a logical vector. TRUE means all two traits from the check_df of the row are both TRUE.
# Create the violate function
violate <- function(product, check_df){
temp_df <- as.data.frame(lapply(check_df, function(Col) Col %in% product))
temp_vec <- apply(temp_df, 1, function(Row) sum(Row) == 2)
return(temp_vec)
}
# Test the violate function
violate(test_products[[3]], check_df = disallowed)
# [1] FALSE FALSE TRUE
After that, I applied the violate function using sapply through the test_products list. The results from violate were evaluated to see if all disallowed checks are FALSE
# Apply the violate function and check if all results from violate is FALSE
sapply(test_products, function(product){
sum(violate(product, check_df = disallowed)) == 0})
# [1] TRUE TRUE FALSE
As you can see, the third element of the results is FALSE, indicating that the third product is not OK, while product 1 and product 2 are OK because the final results from sapply are both TRUE.

min() does not work as expected

I am trying to get the minimum of a a column.
The data has been split into groups using the "abbr" factor. My objective is to return the data in column 2 corresponding to the minimum in column number passed in the argument. If it helps , this is a part of the coursera R programming introductory course.
The minimum is supposed to be somewhere around 8, it shows 10.
Please help me here.
here's the link to the csv file on which i used read.csv
https://drive.google.com/file/d/0Bxkj3-FNtxqrLW14MFZCeEl6UGc/view?usp=sharing
best <- function(abbr, outvar){
## outcome is a dataframe consisting of a column labelled "State" (one of many)
## outvar is the desired column number
statecol <- split(outcome, outcome$State) ##state is a factor which will be inputted as abbr
dislist <- statecol[[abbr]][,2][statecol[[abbr]][, outvar] ==
min(statecol[[abbr]][, outvar])] ##continuation of prev line
dislist
}

In my opinion you are messing up with NA, make sure to specify na as not available and na.rm=TRUE in min..
filedata<-read.table(file.choose(),quote='"',sep=",",dec=".",header=TRUE,stringsAsFactors=FALSE, na.strings="Not Available")
f<-function(df,abbr,outVar,na.rm=TRUE){
outlist<-split(df,df["State"])
tempCol<-outlist[[abbr]][outVar]
outlist[[abbr]][,2][which(tempCol==min(tempCol,na.rm=na.rm))]
}
f(filedata,"AK",44)

R: "missing value where TRUE/FALSE needed" but works with another similar dataset?

I have the following function "cOrder"
library(MASS)
cOrder=function(anm,sir,dam){
maxloop=1000
i = 1
count = 0
mam=length(anm)
old = rep(1,mam)
new = old
while(i>0){
for (j in 1:mam){
ks = sir[j]
kd = dam[j]
gen = new[j]+1
if(ks != "NA"){
js = match(ks,anm)
if(gen > new[js]){new[js] = gen} #where error occurs
}
if(kd != "NA"){
jd = match(kd,anm)
if(gen > new[jd]){new[jd] = gen}
}
} # for loop
changes = sum(new - old)
old = new
i = changes
count = count + 1
if(count > maxloop){i=0}
} # while loop
return(new)
} # function loop
which works brilliantly when imputting the following
dataset:
animal=c("bf","dd","ga","ec","fb","ag","he")
sire=c("dd","ga","NA","ga","NA","bf","dd")
dams=c("he","ec","NA","fb","NA","ec","fb")
gg=cOrder(animal,sire,dams)
but crashes and burns with the following:
animal=c("67947887","67947986","67948372","67948877","67948927","67949057","67950873","67951186","67951285","67951384","67951400","67951525","67951681","68045244","68045657","69999837","77542587","77542629","78468170","79879946")
sire=c("45334307","45334307","40684433","38121933","38141933","40684433","43339787","38431722","40684433","43339787","34931873","40684433","34931873","67951525","67951525","67950873","67951400","67951384","NA","67951681")
dams=c("37084407","25565110","36817369","21897145","21897145","20138814","32629901","37485356","25731548","32129629","31795768","37588084","36812355","68040013","68040500","68040443","67951855","67950980","67949065","67948307")
gg=cOrder(animal,sire,dams)
>Error in if (gen > new[js]) { : missing value where TRUE/FALSE needed
Both of these are inputted as character vectors, so I don't think it is a matter of whether the one set have characters and the other numeric digits. Or could it? Have also tried to make them numeric, import from a .csv, unlist them, etc. Error code stays the same.
My individual names generally consist of 8-digit numeric codes, any suggestions towards preventing this error, or renaming my whole population?
Thanks!
EDIT
The way the datasets are setup is as follows: the first animal in the vector is the offspring of the first dam and sire in their respective vectors. Thus, according the the simple set, bf is the offspring of dd and he, dd of ga and ec, and the parents of ga are unknown.
The idea behind this function is to determine the "oldest" animal/s in the dataset, i.e., the ones with the least number of generations, and eventually in succeeding code order them accordingly and generate a relationship matrix. So it is supposed to be OK if an animal does not appear in the sire list; it means that it is an older animal. So the code is supposed to move on to the next. Which it does in the simple set, but not in the proper one. Any ideas?
Thanks!

It is because your first sire value (45334307) doesn't match anything in your animal list, so match() returns an NA.

R returns list instead of filling in dataframe column

I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.

Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.

As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Complex self-referencing of a dataframe - r

Related

I am trying to run a nested for loop in r which subtracts each row of a variable in a data frame

running apply (or variant) like an embedded loop

min() does not work as expected

R: "missing value where TRUE/FALSE needed" but works with another similar dataset?

R returns list instead of filling in dataframe column

Categories

Resources