I basically have 14 years of data. Within each year, is anywhere from 100-300 observations of age.
I am trying to create a data frame of all of the ages in one column.
If I try
test=data.frame(vals[[1]]$age)
I get a data frame of all of the ages of year 1.
If I try
for (i in 1:length(survey$years){test=data.frame(vals[[i]]$age)}
I get a data frame of the correct number of observations for all of the years, but all "NA" values.
There are "NA" values for some of the observations-- I'm assuming this is the problem, as when I try it with a variable with no NA values (length), it works correctly. How can I get around the blank values?
Related
I'm trying to replace the missing values in R with the value that follows, I have annual data for income by country, and for the missing income value for 2001 for country A I want it to pull the next value (this is for time series analysis with multiple different countries and different columns for different variables - income is just one of them)
I wrote this code for replacing the missing values with the mean, but statistically I think it makes more sense to replace the missing values with the value right below it (that comes next, the next year) since the numbers will be very different depending on the country so if I take an average it'll be of all years for all countries).
Social_data_R<-within(Social_data_R,incomeNAavg[is.na(income)]<-mean(income,na.rm=TRUE))
I tried replacing the mean part of the code above with income[i+1] but it didn't recognize 'i' (I uploaded the data from excel, so didn't create the dataframe manually)
I am trying to run the corr.test function in a for loop between a range of columns in a data frame against the rest of the columns in the same data frame. However, I have a lot of NA values throughout this data frame. I don't want to omit the rows altogether and lose the rest of the data in the rows and I also don't want to set NA = 0 because it will interfere with the rest of the data (scores that are either -1, 1, or 0). Every time I try to run the corr.test function, R keeps saying that x or y are not numeric vectors.
Is there any way to get around this?
The first column (rownames) of my data frame is a list of sample IDs, columns 2-50 are scores, and 51 onward are scores of a different type. What I've been doing so far is using for loop to run corr.test between each range of columns like this example:
cor.test(data[1:50], data[51:200])
This works fine in the for loop if I convert NA values to 0 but is there any way to avoid doing that?
I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.
For example,
My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:
milk <- milk[milk$alpha_s1_casein < 29,]
In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set
Before I ran the above code the amount of NA's were
sum(is.na(milk))
5909 NA values
But after performing the above the sum of NA's now returned is
sum(is.na(milk))
75912 NA values.
I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.
Can anyone help? I'm desperate
Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:
milk <- milk[-which(milk$alpha_s1_casein > 29),]
My initial dataset was a csv containing information about the number of bikes that were rented in a certain city with other variables being temperature,season, etc...
I was creating a subset based on conditionals to get a set that would have seasons be "3" or "4" and annee be "1". I tried the following:
P<- subset(velo,saison>2&annee==1)
I also tried
W<- velo[which(velo$annee==1 & velo$saison>2),]
Which both returned the same dataframe/subset of 183 obs 5 variables
I then wanted to summarise the data through
summary(W$velos[saison==3])
summary(W$velos[saison==4])
It gives me the following outputs
In the data set I can see that the column season is not full of NaN and doing the class() returns integer for that column.
The issue was because of not extracting the column
summary(W$velos[W$saison==3])
I need to create a bunch of subset data frames out of a single big df, based on a date column (e.g. - "Aug 2015" in month-Year format). It should be something similar to the subset function, except that the count of subset dfs to be formed should change dynamically depending upon the available values on date column
All the subsets data frames need to have similar structure, such that the date column value will be one and same for each and every subset df.
Suppose, If my big df currently has last 10 months of data, I need 10 subset data frames now, and 11 dfs if i run the same command next month (with 11 months of base data).
I have tried something like below. but after each iteration, the subset subdf_i is getting overwritten. Thus, I am getting only one subset df atlast, which is having the last value of month column in it.
I thought that would be created as 45 subset dfs like subdf_1, subdf_2,... and subdf_45 for all the 45 unique values of month column correspondingly.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
subdf_i <- subset(df, mnth == uniqmnth[i])
i==i+1
}
I hope there should be some option in the subset function or any looping might do. I am a beginner in R, not sure how to arrive at this.
I think the perfect solution for this might be use of assign() for the iterating variable i, to get appended in the names of each of the 45 subsets. Thanks for the note from my friend. Here is the solution to avoid the subset data frame being overwritten each run of the loop.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
assign(paste("subdf_",i,sep=""), subset(df, mnth == uniqmnth[i])) i==i+1
}