After subsetting data frame, functions do not work on subset - r

My initial dataset was a csv containing information about the number of bikes that were rented in a certain city with other variables being temperature,season, etc...
I was creating a subset based on conditionals to get a set that would have seasons be "3" or "4" and annee be "1". I tried the following:
P<- subset(velo,saison>2&annee==1)
I also tried
W<- velo[which(velo$annee==1 & velo$saison>2),]
Which both returned the same dataframe/subset of 183 obs 5 variables
I then wanted to summarise the data through
summary(W$velos[saison==3])
summary(W$velos[saison==4])
It gives me the following outputs
In the data set I can see that the column season is not full of NaN and doing the class() returns integer for that column.

The issue was because of not extracting the column
summary(W$velos[W$saison==3])

Related

How to create a dummy variable based on row numbers in R

I have a 10 (question items) by 500 (respondents) vector in R.
Upper 250 are male while lower 250 are female.
Can you tell me how to create a gender variable, and assign 0 and 1 to this variable based on row numbers in R?
Thank you very much! Stay safe.
This solution assumes your dataset is in a data frame, not a vector, that the dataset is named "dat" (change it to whatever you are calling your data), and that the variable "gender" does not already exist in "dat".
dat$gender <- NA # Creates a new, empty column in the dataset (NA stands for missing data, or not available)
dat[1:250, "gender"] <- "0" # assigns the category 0 to rows 1-250
dat[251:500, "gender"] <- "1" # assigns the category 1 to rows 251-500
Hope this helps! As the comments suggest, providing a sample of your data will help us help you.

wrong result in comparison

R beginner here. I have a data.frame that contains information on trotting horses (their wins, earnings, time records and such). I have a subsetted data.frame organised in a way that every row contains information for every specific year the horse competed. I have a variable called Competition.age that states what age the horses were each year they competed.
I'm writing down my summary statistics stratified by age and sex of the horse using both the summary() function and describe() from package psych. For example:
summary(Data_year[Data_year$Competition.age>="3"&
Data_year$Competition.age<="6"& Data_year$Sex=="Mare", ])
This works perfectly fine. But when I try to get a range between 7 and 10 years (instead of 3 and 6), it only returns NA's. The str() function with this line of code returns a blank list of variables-for some reason it won't read the data.
I've even created separate subsetted data.frames with only these years (7, 8, 9 and 10 respectively) and there are no problems with those, individually. I created subsetted data frames with ranges 7-8, 7-9 and they were fine! But 7-10 created an empty data.frame.
Any help will be greatly appreciated!!
In your comment you wrote Data_year$Competition.age is an integer. Now it is the following fact: "7" is not numeric. If you compare a numeric value with a non-numeric value (e.g. character) then the numeric value is coerced to character and the comparison is done for characters (alphabetical order). In alphabetical order "3"is greater (after) "10"
See this example:
age <- 1:15
sort(as.character(age))
You want Data_year$Competition.age>=3and Data_year$Competition.age<=6 and so on.

Copy columns of a data frame based on the value of a third column in R

I have a data frame with 4 columns. On one of the columns I added a date so that each value looks like this
>print(result[[4]][[10000]])
[[10000]]
[1] "Jan" "14" "2012"
That means that on the 1000'th field of the 4th column I have these 3 fields. This is the only column that is multiple.
Now the other 3 columns of the data frame result are single values not multiple. One of those columns, the first one, has the states of the United States as values. What I want to do is create a new data frame from column 2 and 4 (the one described above) of the result data frame but depending on the state.
So for example I want all the 2nd column and 4th column data of the state of Alabama. I tried this but I don't think it is working properly. "levels" is the 2nd column and "weeks" is the 4th column of the data frame result.
rst <- subset(result, result$states == 'Alabama', select = c(result$levels, result$weeks))
The problem here is that subset is copying all the columns to rst and not just the second and fourth ones of the result data frame that are linked to Alabama state which are the only ones I want. Any idea how to do this correctly?
Edit to add the code
I'm adding the code here since I think there must be something I'm not seeing here. First a small sample of the original data which is on a csv file
st URL WEBSITE al aln wk WEEKSEASON
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-04-2008 40 2008-09
Alabama http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-11-2008 41 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-18-2008 42 2008-09
Alaska http://adph.org/influenza/ Influenza Surveillance Level 1 Minimal Oct-25-2008 43 2008-09
And this is the code
#Extracts relevant data from the csv file
extract_data<-function(){
#open the file. NAME SHOULD BE CHANGED
sd <- read.csv(file="sdr.csv",head=TRUE,sep=",")
#Extracts the data from the ACTIVITY LEVEL column. Notice that the name of the column was changed on the file
#to 'al' to make the reference easier
lv_list <- sd$al
#Gets only the number from each value getting rid of the word "Level"
lvs <- lapply(strsplit(as.character(lv_list), " "), function(x) x[2])
#Gets the ACTIVITY LEVEL NAME. Column name was changed to 'aln' on the file
lvn_list <- sd$aln
#Gets the state. Column name was changed to 'st' on the file
st_list <- sd$st
#Gets the week. Column name was changed to 'wk' on the file
wk_list <- sd$wk
#Divides the weeks data in month, day, year
wks <- strsplit(as.character(wk_list), "-")
result<-list("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
return(result)
}
forecast<-function(){
result=extract_data()
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
return(0) #return results
}
You're nearly there, but you don't need to reference the dataframe in the select argument - this should work:
rst <- subset(result, states == 'Alabama', select = c(levels, weeks))
You could also look into the package dplyr, which gives you SQL like abilities and is great for manipulating larger and more complicated data sets.
EDIT
Thanks for posting your code - I think I've identified a few problems.
The result you return from extract_data() is a list, not a data.frame - which is why the code in forecast() doesn't work. If it did return a dataframe the original solution would work.
You're forming your list out of a combination of vectors and lists, which is a problem - a dataframe is (roughly) a list of vectors, not a collection of the two types. If you replace your list creation line with result <- data.frame(...) you run into problems because of this.
There are two problematic columns - lvs (or levels) and wks (weeks). Where you use lapply(), using sapply() instead would give you a vector, as required (see the manual). The second issue is the weeks column. What you're actually dealing with here is a list of character vectors of length 3. There's no easy way to do what you want - you can't, for example, have each 'cell' of a column in a dataframe contain a character vector, as the columns are themselves vectors.
My suggestions would be to either:
Use the original format "Oct-01-2008", i.e. construct your data.frame with wk_list rather than splitting each date into the three strings;
Convert the original format into a better time format with a package like lubridate (A+++++ would recommend, great package);
Or finally, split the week column into three columns, so you'd have one for month, one for day and one for year. You could do this very simply from wk_list like this:
wks <- sapply(strsplit(as.character(wk_list), "-"), function(x) c(x[1], x[2], x[3]))
Month <- wks[1,]
Day <- wks[2,]
Year <- wks[3,]
Once both lvs and wks are in vector form, you're good to just run
result<-data.frame("states"=st_list,"levels"=lvs,"lvlnames"=lvn_list,"weeks"=wks)
and the script should work.

Create a stack of n subset data frames from a single data frame based on date column

I need to create a bunch of subset data frames out of a single big df, based on a date column (e.g. - "Aug 2015" in month-Year format). It should be something similar to the subset function, except that the count of subset dfs to be formed should change dynamically depending upon the available values on date column
All the subsets data frames need to have similar structure, such that the date column value will be one and same for each and every subset df.
Suppose, If my big df currently has last 10 months of data, I need 10 subset data frames now, and 11 dfs if i run the same command next month (with 11 months of base data).
I have tried something like below. but after each iteration, the subset subdf_i is getting overwritten. Thus, I am getting only one subset df atlast, which is having the last value of month column in it.
I thought that would be created as 45 subset dfs like subdf_1, subdf_2,... and subdf_45 for all the 45 unique values of month column correspondingly.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
subdf_i <- subset(df, mnth == uniqmnth[i])
i==i+1
}
I hope there should be some option in the subset function or any looping might do. I am a beginner in R, not sure how to arrive at this.
I think the perfect solution for this might be use of assign() for the iterating variable i, to get appended in the names of each of the 45 subsets. Thanks for the note from my friend. Here is the solution to avoid the subset data frame being overwritten each run of the loop.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
assign(paste("subdf_",i,sep=""), subset(df, mnth == uniqmnth[i])) i==i+1
}

Iterating through rows with "NA" values to create a data frame

I basically have 14 years of data. Within each year, is anywhere from 100-300 observations of age.
I am trying to create a data frame of all of the ages in one column.
If I try
test=data.frame(vals[[1]]$age)
I get a data frame of all of the ages of year 1.
If I try
for (i in 1:length(survey$years){test=data.frame(vals[[i]]$age)}
I get a data frame of the correct number of observations for all of the years, but all "NA" values.
There are "NA" values for some of the observations-- I'm assuming this is the problem, as when I try it with a variable with no NA values (length), it works correctly. How can I get around the blank values?

Resources