Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
This is the code I used to look at a subset of data:
active<-clinic[
(clinic$Days.since.injury.physio > 20 & clinic$Days.since.injury.physio < 35)
&(clinic$Days.since.injury.F.U.1 > 27 & clinic$Days.since.injury.F.U.1 < 63)
, ]
I'd like to select a group of subjects based on two criteria and then analyze their results. To start I was looking at the descriptive data when I noticed na's that exceeded the entire data set.
Subsetting seemed to result in NA's. I've looked at several posts including these two below that seem relevant but I don't understand how to apply the answers.
Why does subset cause na's that don't exist in the full data set? (I think the answer from other posts is that there is an na in another variable?)
How do I work around this?
I'd like to be able to get values from the variables that are present rather than ignoring the whole row if there is a missing value.
Thank you.
Subsetting R data frame results in mysterious NA rows
NA when trying to summarize a subset of data (R)
This is a workaround, a response to your #2
Looking at your code, there is a much easier way of subsetting data. Try this.
Check if this solves your issue.
library(dplyr)
active<- clinic %>%
filter(Days.since.injury.physio>20,
Days.since.injury.physio<35,
Days.since.injury.F.U.1>27,
Days.since.injury.F.U.1<63
)
dplyr does wonders when it comes to subsetting and manipulation of data.
The %>% symbol chains statements together so you don't ever have to use the $ symbol.
If, for some bizarre reason, you don't like this, you should look at the subset function in r.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I'm quite new to R and currently stuck on the following task. I have spatial data in the following format:
lat long
1 49,6837508756316 8,97846155698244
2 49,9917393661473 8,2382869720459
3 51,308416699361 12,4118696787101
4 50,7048668720388 6,62725165486336
...
and so on. It's a pretty large data set.
I've been advised to convert my data set into sf data to properly work with it. Can somebody help my with that? I think one problem might also be, that the decimal mark is an ,.
Thanks for your help guys!
I assume the data is in a data.frame called sf:
sf <- data.frame(lat=c("49,6837508756316","49,9917393661473","51,308416699361","50,7048668720388"),long=c("8,97846155698244","8,2382869720459","12,4118696787101","6,62725165486336"), stringsAsFactors = FALSE)
The problem is, that the entries are characters, so you have to convert them to numeric. This can be done via as.numeric, but this function expects the decimals to be seperated by a dot ., hence you have to convert the comma to a dot and then call as.numeric. The conversion can be done using the function gsub.
sf$lat <- as.numeric(gsub(",",".",sf$lat))
sf$long <- as.numeric(gsub(",",".",sf$long))
If you have many columns and you dont want to copy-paste the above for every column, I would suggest you to use:
sf[] <- lapply(sf, function(colValues) as.numeric(gsub(",",".",colValues)))
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
First off, I have looked at all the examples and previous questions and have not been able to find a usable answer to my situation.
I have a data set of 300ish independent variables I'm trying to bring into R. The variables are all classified as factors. In my csv file I'm uploading, all of the variables are pricing data with two decimal places. I have used the following code and some of the variables have been converted with decimals. However, many of the converted columns are filled with NAs; in fact, some entire columns are completely NAs.
dsl$price = as.numeric(as.factor(dsl$price)) # <- this completely changes the data into something unrecognizablbe
dsl$price = as.numeric(as.character(dsl$price)) # <- lots of NAs or totally NAs
I've tried to recode the variables in the original CSV file to numeric, but with no luck.
Convert the factor into character which can then be converted into numeric
dsl$price <- as.numeric(as.character(dsl$price))
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Dataset I am working on looks like-DATA there are 6 different countries and r_1..r_13 specifies the reasons. I want to apply PCA on this dataset to find out the significant reasons for each country
The question I want to ask is how can I run PCA for each country without reading file for each country instead I want to read the entire file as shown above.
Also please check the code I am using for doing PCA-
pca<-prcomp(numeric,center=T,scale=T)
summary(pca)
eigen_val<-pca$sdev ^2
sum(eigen_val)
prop_var<-round(eigen_val/sum(eigen_val),4)
round(sum(prop_var[1:13]),4)
load<-pca$rotation
After computing rotation matrix I will check which PC's are most correlated with which observed variables and accordingly I will decide the significance of the variables.(on the basis of- more than no. of PC's correlated with variable more is the significance of the variable)
Kindly suggest whether the approach is correct or not !
Thanks!!
Here's a simple starting point for a solution that you can tweak to get the results in your desired format. Let's assume you're working with the iris dataset in R, and you want to do pca for each Species, kind of like how you want to do pca by each country in your data.
library(caret)
data(iris)
Iris <- split(iris, iris$Species)
for(i in 1:length(Iris)){
assign(paste0("pca", i), prcomp(Iris[[i]][which(names(iris)!="Species")], center=T, scale.=T))
}
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a database which has "null" values for some elements. I wanted to replace these "null" values or even delete rows including "null" values from database but I could not find a way to do so. I used is.na to return these rows but it seems that NA and NULL are two different concepts in r. Does anybody know how I can do this?
Thanks
"null" might be coming in the class character, as in "null" is not the same thing as NULL in R, so is.null() will not work if it is treating it as a character sequence.
You should be able to find these values pretty easily in your dataset using a which statement:
todelete <- which(dataset == "null", arr.ind=TRUE)[1]
newdataset <- dataset[-todelete,]
Or you can try to give a snapshot of what your dataset looks like, using str(dataset) and we can help diagnose better.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
So here's my problem:
I have a bunch of data about sound production and where the emphasis falls in a word. What I'm trying to do is determine if the difference between production on stressed and unstressed syllables is significant. The problem is that when I try to use the cor() function, the data sets aren't the same length. I have about 500 instances of stressed syllables, but only 400 of unstressed syllables. I'm very new to r, but here's the code I've attempted:
data <- read.csv('D:/blaaah/Stressed.csv', header=TRUE)
var1 <- data$intdiff
data <- read.csv('D:/blaaah/Unstressed.csv', header=TRUE)
var2 <- data$intdiff
cor(var1, var2)
Of course, I get an error because the data sets are different lengths. So how do I check for significance between the sets without having them be the same length?
Thanks a bunch!
P.S. Just ask if my question isn't clear. I'm afraid I sometimes assume everyone knows what I'm doing...
Using cor() would be appropriate if you expected there to be a relationship between var1 and var2, for instance if you'd expect the value of an item in var2 to be larger if the corresponding item in var1 is larger. There is a difficulty when the data sets are not the same length, because there are no corresponding items to compare once you get past the end of the shorter dataset.
I think, in this case, that a comparison of the two data sets to establish if their means are different is more likely to be useful to you. For that, you'd want to use a t test, as described, with examples in R, here. You'd also want to confirm that the assumptions for using the t test are valid for this case, e.g. see here.