I have a matrix with of 1's and 0's. I want to replace all the 1's by an identifier code for that row (the identifier code is given in row 2)
I tried:
dax[,2:109] <- replace(dax[,2:109],dax[,2:109]==1,dax[2,])
but this isn't working right. I've tried to set up a loop, but I've had no success so far.
I'm new to R. Any help is appreciated
This may do it for you, although it'd be nice to get more details from you.
for(j in 2:109) dax[dax[,j]==1,j] <- dax[2,j]
Related
I want to have "Blancas" and "Sultana" under the "Variete" column.
Why after I use "subset", the filtered data is less than it should be?
Figure 1 is the original data,
figure 2 is the expected result,
figure 3 is result I obtained with the code below:
df <- read_excel("R_NLE_FTSW.xlsx")
options(scipen=200)
BLANCAS<-subset(df, Variete==c("Blancas","Sultana"))
view(BLANCAS)
It's obvious that some data of BLANCAS are missing.
P.S. And if try it in a sub-sheet, the final result sometimes will be 5 times more!
path = "R_NLE_FTSW.xlsx"
df <- map_dfr(excel_sheets(path),
~ read_xlsx(path, sheet = 4))
I don't understand why sometimes it's more and sometimes less than the expected result. Can anyone help me? Thank you so much!
First of all, while you mention that you need both "Blancas" and "sultanas" , your expected result shows only Blancas! So get that straight first.
For such data comign from excel :
Always clean the data after its imported. Check for unqiue values to find if there are any extra spaces etc.
Trim the character data, ensure Date fields are correct and numbers are numeric (not characters)
Now to subset a data : Use df%>%filter(Variete %in% c('Blancas','Sultana')
-> you can modify the c() vector to include items of interest.
-> if you wish to clean on the go?
df%>%filter(trimws(Variete)) %in% c('Blancas','Sultana'))
and your sub-sheet problem : We even don't know what data is there. If its similar then apply same logics.
Hear me out. Consider an arbitrary case where the new column's elements do not require any information from other columns (which I frustrates base $ and mutate assignment), and not every element in the new column is the same. Here is what I've tried:
df$rand<-rep(sample(1:100,1),nrow(df))
unique(df$rand)
[1] 58
and rest assured, nrow(df)>1. I think the correct solution might have to do with an apply function?
Your code repeats one single random number nrow(df) times. Try instead:
df$rand<-sample(1:100, nrow(df))
This samples without replacement from 1:100 nrow(df) times. Now this would give you an error if nrow(df)>100 because you would run out of numbers from 1:100 to sample. To make sure you don't get this error, you can instead sample with replacement:
df$rand<-sample(1:100, nrow(df), replace = TRUE)
If, however, you don't want any random numbers to repeat but would also like to prevent the error, you can do something like this:
df$rand<-sample(1:nrow(df), nrow(df))
if I understand this correctly ,I think this is pretty easily doable in dplyr or data.table .
for e.g dplyr soln on iris
iris%>%mutate(sample(n()))
Not sure if the title is clear enough. I have the following dataframe: (ST.final is the name of the df)
n;date;ws;wd
1;2011-11-01 00:00:00;7,15;113,7
2;2011-11-01 00:10:00;7,25;115,7
3;2011-11-01 00:20:00;NA;NA
4;2011-11-01 00:30:00;NA;NA
5;2011-11-01 00:40:00;7,2;100,7
6;2011-11-01 00:50:00;6,95;104,7
And I want to create a new one with the rows containing NAs plus the upper and lower limit rows. The result should be something like this:
n;date;ws;wd
2;2011-11-01 00:10:00;7,25;115,7
3;2011-11-01 00:20:00;NA;NA
4;2011-11-01 00:30:00;NA;NA
5;2011-11-01 00:40:00;7,2;100,7
Maybe I am missing something but I have no clue on how to perform this task. So far I am trying to use this
interp.df <- ST.final[(is.na(ST.final$ws)),]
and as expected it just copy every row containing NAs. I searched for a solution on google but couldnt find anything similar.
Any help is appreciated.
You could try
idx <- which(!complete.cases(ST.final))
idx <- sort(unique(c(idx-1, idx, idx+1)))
ST.final[idx, ]
So I know this has been asked before, but from what I've searched I can't really find an answer to my problem. I should also add I'm relatively new to R (and any type of coding at all) so when it comes to fixing problems in code I'm not too sure what I'm looking for.
My code is:
education_ge <- data.frame(matrix(ncol=2, nrow=1))
colnames(education_ge) <- c("Education","Genetic.Engineering")
for (i in 1:nrow(survey))
if (survey[i,12]=="Bachelors")
education_ge$Education <- survey[i,12]
To give more info, 'survey' is a data frame with 12 columns and 26 rows, and the 12th column, 'Education', is a factor which has levels such as 'Bachelors', 'Masters', 'Doctorate' etc.
This is the error as it appears in R:
for (i in 1:nrow(survey))
if (survey[i,12]=="Bachelors")
education_ge$Education <- survey[i,12]
Error in if (survey[i, 12] == "Bachelors") education_ge$Education <- survey[i, :
missing value where TRUE/FALSE needed
Any help would be greatly appreciated!
If you just want to ignore any records with missing values and get on with your analysis, try inserting this at the beginning:
survey <- survey[ complete.cases(survey), ]
It basically finds the indexes of all the rows where there are no NAs anywhere, and then subsets survey to have only those rows.
For more information on subsetting, try reading this chapter: http://adv-r.had.co.nz/Subsetting.html
The command:
sapply(survey,function (x) sum(is.na(x)))
will show you how many NAs you have in each column. That might help your data cleaning.
You can try this:
sub<-subset(survey,survey$Education=="Bachelors")
education_ge$Education<-sub$Education
Let me know if this helps.
I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.