I am calculating returns in R, and trying to add it to the current dataframe I am working with, but it doesnt work due to a difference in rows, where as existing rows are 194, and assigned data has 193 rows.
This code works just fine when doing it on its own:
diff(log(capm$price_Ford))
But when I try to assign it into the dataframe as its own column, I get the an error
capm$ford_ret <- diff(log(capm$price_Ford))
How can I assign the data with 193 rows, to a dataframe with 194 rows?
How can I assign the data with 193 rows, to a dataframe with 194 rows?
In a nutshell, you can’t. Each column in a table must have the same number of rows. You need to decide what to fill into the row that’s missing a value. Depending on your use-case, this might for example be 0 or NA. You also need to decide whether the missing value should go at the beginning or at the end (for a difference, usually at the beginning). For example:
capm$ford_ret <- c(NA, diff(log(capm$price_Ford)))
Related
I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.
For example,
My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:
milk <- milk[milk$alpha_s1_casein < 29,]
In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set
Before I ran the above code the amount of NA's were
sum(is.na(milk))
5909 NA values
But after performing the above the sum of NA's now returned is
sum(is.na(milk))
75912 NA values.
I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.
Can anyone help? I'm desperate
Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:
milk <- milk[-which(milk$alpha_s1_casein > 29),]
I'm having some trouble randomly sampling 1 column out of a group. I have over 300 columns, and over 500 rows. I am attempting to sample 1 column out of the first 15, and then move on to sample 1 column from the next 15, etc... until there are no more.
For the basic first sample, I used:
sample(DATA[,1:15],1)
But it only outputs a single number. If I change my size to 535 (amount of rows), it grabs 535 random numbers in total from columns 1:15.
I referenced the below link, which had a somewhat similar basis, but the accepted answer is what I tried and can't seem to work:
R: random sample of columns excluding one column
Any suggestions?
The output of a sample function is an integer. It should be used to randomize the column of the dataframe, not the entire dataframe, like you did earlier.
DATA[,sample(1:15,1)]
This will randomly select columns from 1 to 15 and will return the output as you desired.
Found my answer pretty quickly:
DATA[,sample(1:15,1)]
I have several text files containing 2 columns and different row numbers. I would like to follow drawing a plot using ggplot2 as explained enter link description here; however, it works well for dataframes with equal row numbers, and I couldn't reproduce it with dataframes with different row numbers.
please let me know how I should combine these data frames (dataframes with different row number) using R?
case siza
case1 129
case2 129
case3 130
case4 131
case5 132
case6 132
Thank you
It seems from the comments that you're actually trying to merge multiple columns and then plot each column individually. The problem, however, is that each of these columns has a different number of rows. Therefore you need to combine them based on some common variable (i.e. row names).
Using the examples from the link you provided:
df1 = data.frame(size=runif(300,300,1200))
#now adding an unequal column
df2 = data.frame(size=df1[c(1:275),])
Now merge the data frames based on row number. "all=TRUE" keeps all the values, "by=0" merges by row.names.
df.all=merge(df1$size,df2$size,by=0,all=TRUE)
#and to order the row names.
df.all=df.all[order(as.numeric(df.all[,1])),]
#finally if you want to remove the NA values
df.all[is.na(df.all)]=0
Does that get you the data.frame you want?
I'm attempting to create a counter variable in R which loops through the n rows of my 442 column dataframe and increases the counter by 1 on every 55th row.
I've tried the following code:
dataset$num=ceiling(row(dataset)/55)
which works fine, however R duplicates the function for every column in my dataframe rather than simply creating a single new column containing the counter variable. So I have 442 copies of the same variable titled num.1, num.2, ..., num.442.
What am I doing wrong? Thanks!
It sounds like you just want something like:
rep(1:1000,each=55,length.out=nrow(dataset))
The 1000 here could be anything as long as it's larger than nrow(dataset)/55.
I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]