I have a dataset in which the data would look something like this:
a fragment dataframe with the data
So lot's of NA's per row, but also regular answers, that I want in the final version.
Is it possible to remove the NAs, but without removing the rows as a whole?
I thought about pivoting and removing rows with NA, but then it would just remove the occurences that have actual answers as well.
The data is coming form a decision making procedure in qualtrics, in which not every option is displayed to the participants (hence the NAs), but we do not want to exclude people in any step. I also thought about maybe recoding the values, and subsetting them somehow, but that doesn't seem to work out right in my mind when it comes to the actual analysis.
I tried removing the NAs, as well as pivoting the table and removing them later.
I do not yet have the full dataset, but want to experiment on strategies of data analysis before I have the data collected, to not get lost once I have it.
Related
This might be a stupid question. I have some 'NA' in a matrix, I need to put this matrix into jags model, but I want to remove those NA. Can I remove only NA but keep the rest of the data?
My data looked like the picture below. Can I have rows with different column numbers?
You cannot.
You need to impute these missing values or remote either the column or the row entirely.
Imputing missing values is as complicated as you want it to be. You'd be best of looking into the first few google searches on the topic or just using the mean value of the column.
I couldn't find the answer in any previously asked questions, but I believe this is an easy one.
I have the below two lines of code, which take in data from excel in a specific range (using readxl for this). The range itself only goes through row 2589 in the excel document, but it will update dynamically (it's a time series) and to ensure I capture the different observations (rows) as they're added, I've included rows to 10000 in the read_excel range argument.
In the end, I'd like to run charts on this data, but a key part of this is identifying the last used row, without manually updating the code row for the latest date. I've tried using nrow but to no avail.
Raw_Index_History <- read_excel("RData.xlsx", range = "ReturnsA6:P10000", col_names = TRUE)
Raw_Index_History <- Raw_Index_History[nrow(Raw_Index_History),]
Does anybody have any thoughts or advice? Thanks very much.
It would be easier to answer your question if you include an example.
Not knowing how your data looks like answers are likely going to be a bit vague.
Does your data contain NAs? If not it should be straight forward to remove the empty rows with
na.omit(Raw_Index_History)
It appears you also have control over the excel spreadsheet. So in case your data does contain NAs you could have some default value in your empty rows that will get overwritten as soon as a new data point is recorded. This will allow you to filter your dataframe accordingly.
Raw_Index_History[!grepl("place_holder", Raw_Index_History$column_with_placeholder),]
If you expect data in the spreadsheet to grow, you can specify only the columns to include, instead of a defined boundary.
Something like this ...
Raw_Index_History <- read_excel("RData.xlsx",
sheet = 1,
range = cell_cols("A:P"), # Only cols, no rows
col_names = TRUE)
Every time you run the code, R will pull in the data from columns between A:P up until the last populated row.
This will be a more elegant approach to your use case. (Consider what you'd do when your data crosses 10000 rows in the future)
I hope this has not been answered, but when I search for a solution to my problem I am not getting any results.
I have a data.frame of 2000+ observations and 20+ columns. Each row represents a different observation and each column represents a different facet of data for that observation. My objective is to iterate through the data.frames and select observations which match criteria (eg. I am trying to pick out observations that are in certain states). After this, I need to subtract or add time to convert it to its appropriate time zone (all of the times are in CST). What I have so far is an exorbitant amount of subsetting commands that pick out the rows that are of the state being checked against. When I try to write a for loop I can only get one value returned, not the whole row.
I was wondering if anyone had any suggestions or knew of any functions that could help me. I've tried just about everything, but I really don't want to have to go through each state of observations and modify the time. I would prefer a loop that could easily go through the data, select rows based on their state, subtract or add time, and then place the row back into its original data.frame (replacing the old value).
I appreciate any help.
I have the set of data below. It has a few rows of unwanted characters before the numbers I want to read in, as well as a few unwanted rows after the data. I created a substring that will serve as my first column, which is purely numerical. There is data, when the set is read in, above and below these numericals that were converted to NA. Is there a way, other than skip and nrow, that I can remove the NA rows and read in only those rows that are numerical?
x<-read.csv("..."),
header=FALSE, na.strings="Y")
y<-substr(x$V1,1,8)
y<-as.numeric(y)
x2<-cbind(y,x1)
x2<-as.data.frame(x2)
I have tried:
if (x$y == is.numeric) {
print(x)
} else {
print("")}
But that is clearly wrong as all I get are errors. I have been trying different combinations of the above code, as well as:
x3<-sapply(x$y,is.numeric)
x[x3,]
But nothing I try is working.. I am either completely off or am missing something.
UPDATE: I was able to do this with both methods that were answered below.. but the problem now is, since the rows above the numeric rows contained characters, my columns are factors rather than numeric. Rather than actually deleting the rows, we were just temporarily removing them. Is there a way to permanently remove them so that my columns will be class numeric?
If this is just the case of remove rows containing NAs, have you tried using complete.cases? Perhaps something like:
x2[complete.cases(x2),]
Also if would be great if you could provide a minimal reproducible sample.
I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.