So I have data imported to r using data=read.delim("clipboard")
This is the last sections of the data....so I decided to use data2=na.omit(data,method="linear") which gave me this result...
But as you can see I have lost data from 290 to 293 ....for the 3rd and 4th column....pls help remove those NA values without losing data from the other columns...The data I have given you represents time and speed data...and what I'm trying to do is find the average speed every 100s etc...using a code pointed out to me earlier in my previous questions which is in this link...h
Keep the na values as they are but use na.rm in your subsequent manipulations; eg, sum(df[,1], na.rm = TRUE) where df is your data frame.
Related
I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.
For example,
My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:
milk <- milk[milk$alpha_s1_casein < 29,]
In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set
Before I ran the above code the amount of NA's were
sum(is.na(milk))
5909 NA values
But after performing the above the sum of NA's now returned is
sum(is.na(milk))
75912 NA values.
I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.
Can anyone help? I'm desperate
Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:
milk <- milk[-which(milk$alpha_s1_casein > 29),]
I have a data frame of size 1379 x 843 such that the rows are the daily prices, and the columns are the securities.
I want to calculate returns and subset these returns based on a drop in 30% in a day, but I am having trouble dealing with the large number of NA values.
How do most of you go about dealing with NA values, especially given the case which I have described?
Never mind, I figured it out. Just using the functions contained in performance analytics worked. I didn't check the output carefully enough.
I am new to R and really trying to wrap my head around everything (even taking online course--which so far has not helped at all).
What I started with is a large data frame containing 97 variables pertaining to compliance with regulations.
I have created multiple dataframes based on the various geographic locations (there is probably an easier way to do it).
In each of these dataframes, I have 7 variables I would like to find the mean of "Yes" and "No" responses.
I first tried:
summary(urban$vio_bag)
Length Class Mode
398 character character
However, this just tells me nothing useful except that I have 398 responses.
So I put this into a table:
urbanbag<-table(urban$vio_bag)
This at least provided me with the number of Yes and No responses
Var1 Freq
1 No 365
2 Yes 30
So I then converted to a data.frame:
urbanbag = as.data.frame(urbanbag)
Then viewed it:
summary(urbanbag)
Var1 Freq
No :1 Min. : 30.0
Yes:1 1st Qu.:113.8
Median :197.5
Mean :197.5
3rd Qu.:281.2
Max. :365.0
And the output still definitely did not help.. much more useless actually.
I am not building these Matrices in R. It is a table imported from excel.
I am just so lost and frustrated having spent days trying to figure out something that seems so elementary and googling help which did not work out.
Is there a way to actually do this?
We can use prop.table to get the proportion
v1 <- prop.table(table(urban$vio_bag))
then use barplot to plot it
barplot(v1)
Try with dplyr's n() (perfomrs counts) within sumarisse()
library(dplyr)
data %>% group_by(yes_no_column) %>% summarise(my_counts = n())
This will give you the counts you're looking for. Adjust the group_by() variables as needed -multiple variables can be used at the time for grouping purposes. Just like with n(), a function such as mean and sd can be passed to summarise. If you want to make a column out of each calculated metric, use mutate()
Oscar.
prop.table is a useful way of doing this. You can also solve this using mean:
mean(urban$vio_bag == "Yes")
mean(urban$vio_bag == "No")
I want to aggregate and count how often in my dataset is a special kind of disease at one date. (I don't use duplicate because I want all rows, not only the duplicated ones)
My original data set looks like:
id dat kinds kind
AE00302 2011-11-20 valv 1
AE00302 2011-10-31 vask 2
(of course my data.frame is much larger)
I try this:
xagg<-aggregate(kind~id+dat+kinds,subx,length)
names(xagg)<-c("id","dat","kinds","kindn")
and get:
id dat kinds kindn
AE00302 2011-10-31 valv 1
AE00302 2011-11-20 vask 1
I wonder why R is going wrong by the 'date' resp. the 'kinds'-column.
Has anybody an idea?
I still don't know why.
But I found out, aggregate goes wrong, because of columns I don't use for aggregating.
Therefor these steps solve the problem for me:
# 1st step: reduce the data.frame to only the needed columns
# 2nd Step: aggregate the reduced data.frame
# 3rd Step: merge aggregated data to reduced dataset
# 4th step: remove duplicated rows from reduced dataset (if they occur)
# 5th step: merge reduced dataset without dublicated data to original dataset
Maybe the problem occurs, if there are duplicated datasets in the aggregated data.frame.
Thanks for all your help, questions and attempts to solve my problem!
elchvonoslo
I am trying to subset a data frame based on a range of time. Someone has asked this question in the past and the answer was to use R CMD INSTALL lubridate_1.3.1.tar.gz (see link: subset rows according to a range of time.
The issue with this answer is that I get the following warning:
> install.packages("lubridate_1.3.2.tar.gz")
Warning in install.packages :
package ‘lubridate_1.3.2.tar.gz’ is not available (for R version 3.1.2)
I am looking for something very similar to this answer but I cannot figure out how to do this. I have a MasterTable with all of my data organized into columns. One of my columns is called maxNormalizedRFU.
My question is simple:
How can I subset my maxNormalizedRFU column by time?
I would simply like to add another column which only displays the maxNormalizedRFU the data between 10 hours and 14 hours. Here is what I have up to now:
#Creates the master table
MasterTable <- inner_join(LongRFU, LongOD, by= c("Time.h", "Well", "Conc.nM", "Assay"))
#normalizes my data by fluorescence (RFU) and optical density (OD) based on 6 different subsets called "Assay"
MasterTable$NormalizedRFU <- MasterTable$AvgRFU/MasterTable$AvgOD
#creates a column that only picks the maximum value of each "Assay"
MasterTable <- ddply(MasterTable, .(Conc.nM, Assay), transform, maxNormalizedRFU=max(NormalizedRFU))
#The issue
MasterTable$CutmaxNormalizedRFU <- ddply(maxNormalizedRFU, "Time.h", transform, [MasterTable$Time.h < 23.00 & MasterTable$Time.h > 10.00,])
Attached is a sample of my dataset. Since the original file has over 90 000 lines, I have only attached a small fraction of it (only one assay and one concentration).
My line is currently using ddply to do the subset but this simply does not work. Does anyone have a suggestion as to how to fix this issue?
Thank you in advance!
Marty
I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h. Here you have a range of time (10-23) you want. I used dplyr and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h. Your data frame is called mydf here.
library(dplyr)
filter(mydf, between(Time.h, 10, 23))