How does R know that I have no entries of a certain type - r

I have a table where one of the variables is country of registration.
table(df$reg_country)
returns:
AR BR ES FR IT
123 202 578 642 263
Now, if I subset the original table to exclude one of the countries
df_subset<-subset(df, reg_country!='AR')
table(df_subset$reg_country)
returns:
AR BR ES FR IT
0 202 578 642 263
This second result is very surprising to me, as R seems to somehow magically know that I have removed the the entries from AR.
Why does that happen?
Does it affect the size of the second data frame (df_subset)? If 'yes' - is there a more efficient way to to subset in order to minimize the size?

df$reg_country is a factor variable, which contains the information of all possible levels in the levels attribute. Check levels(df_subset$reg_country).
Factor levels only have a significant impact on data size if you have a huge number of them. I wouldn't expect that to be the case. However, you could use droplevels(df_subset$reg_country) to remove unused levels.

Related

How can I regroup this two tables together

I want to regroup this two variables R (votefr, voteaut) together how can I do it? To have in fine only one variable with the french vote and the austrian vote (maybe votefraut).
(below the table of this two variables)
table(d$votefr)
Centre-droite Extreme-droite Gauche
455 117 356
table(d$voteaut)
Centre-Droite Extreme-Droite Gauche
424 208 545
The result that I want is:
table(d$votefraut)
Centre-Droite Extreme-Droite Gauche
879 325 901
If you want to sum up all votes across different countries, you can try
Reduce(`+`, Map(table, df))
Otherwise, you can check table(df)
If both columns are factors with the same levels, we can just add them:
table(d$votefr) + table(d$voteaut)
If they are character class or factor with different levels, we need convert them to factors with the same levels.

Order a vector in function to another vector

>nuevos<-(exam[411:510,1])
> [,1]
401 -0.325087210
402 0.576824342
403 0.314110438
404 -0.710141482
405 0.079179458
406 0.876819478
407 -0.563755647
408 -0.024573542
409 0.072860869
410 0.141759722
411 0.645346837
412 -0.178754696
413 -0.745086021
414 0.741761201
415 1.537360962
416 0.478560270
417 -0.721503050
418 -0.136435690
419 -0.264058207
420 1.851815905
421 0.854542022
422 0.055184071
423 0.214454147
424 -0.374941314
425 0.268580192
426 0.458531169
427 0.440158449
428 -1.539627467
429 -0.146698388
430 -0.174844929
This is my data, it's a matrix. The first column is the ID and the second column is the X value. I want to select 10 ID. In the 10 selected, 5 should be from unpair number ID, and the other 5 should be from ood number ID. The 10 ID selection should be in function from the X value (the most negative value is the best). I want to have something like this:
ID X
428 -1.539627467
413 -0.745086021
....
I tried to use sort(data[data%%2==1])[1:5] but I don't understand how can I extract the column ID from the dataset, because this is a result from a linear model, so R give me the positions but I want to work with this positions and the X value. Please, help me!
Thanks.
The numbers in the first "column" are the rownames of the matrix.
Since the objects in your question have differing names, it's not entirely clear to me if the following works like that.
So I would do something like this:
df=data.frame(ID=rownames(exam),X=exam[,1])
Otherwise please post the output of dput(exam) or dput(data)
Based on what I think you want to do, here's a working example, given the following data frame:
# generate random input data
data <- data.frame(ID=1:20, X=rnorm(20))
Tidyverse offers the cleanest solution:
require(tidyverse)
data %>%
arrange(X)
will sort in ascending order according to column x. Check the documentation for arrange for further details; you can do more complex things such as sorting by group, sorting on multiple columns (ie, specify a first column, and break ties based on successively sorted columns, etc). So what I would recommend would be to put your data into a data frame first:
data <- data.frame(ID=rownames(nuevos), X=nuevos[,1])
where you could substitute ID with whatever you want and then do the above. Add a dput of nuevos for more specific feedback. Note there are a million ways under the sun to do this not involving tidyverse (ie, sort as you mentioned, for instance); tidyverse just tends to make for the cleanest, simplest mechanism in my opinion (since it is plug and play with many other useful things, like ggplot, dplyr, etc) and is really a great way of thinking to get accustomed to for working with data frames, such as this.

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

summary() of transactions is wrong for itemMatrix object

I am trying to do some market basket analysis using the arules package, but when I use the summary() function on an itemMatrix object to check which are the most frequent items, the numbers do not add up.
If I do:
library(arules)
x <- read.transactions("Supermarket2014-15.csv")
summary(x)
I get:
transactions as itemMatrix in sparse format with
5001 rows (elements/itemsets/transactions) and
997 columns (items) and a density of 0.003557162
most frequent items:
45 28 42 35 22 (Other)
503 462 444 440 413 15474
But if I check with a for loop, or even in Excel, the count for the product 45 is 513 and not 503. The same for 28, which should be 499, and so on.
The odd thing is if I sum up all the totals (15474+413+440+444+462+503) I get the correct number for the total of transacted products.
The data has several NA values and products are factors.
And here is the raw data (Day ranges from 1 to 28, Product ranges from 1 to 50):
If you look at the result of your str(x) call then you see under #iteminfo and $labels that some items have labels like "1;1", etc. This means that the items are not correctly separated after reading the file in. The default separator in read.transactions() is a white space, but you seem to have (some) semicolons there. Try sep=";" in read.transactions().

R: Plots of subset still include excluded attributes, how do I get draw a plot without them?

I am trying to draw a boxplot in R:
I have a dataset with 70 attributes:
The format is
patient number medical_speciality number_of_procedures
111 Ortho 21
232 Emergency 16
878 Pediatrics 20
981 OBGYN 31
232 Care of Elderly 15
211 Ortho 32
238 Care of Elderly 11
219 Care of Elderly 6
189 Emergency 67
323 Emergency 23
189 Pediatrics 1
289 Ortho 34
I have been trying to get a subset to only include emergency, pediatrics in a boxplot (there are 10000+ datapoints in reality)
I thought that I could just do this:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
plot(newdata)
Since if I do a summary of newdata, all it has is the pediatrics and emergency results. But when it comes to plotting it still includes the ortho, OBGYN, care of elderly in the x axis with no boxplot.
I presume that there is a way to do this in ggplot by doing
ggplot(newdata, aes(x=medical_speciality, y=num_of_procedures, fill=cond)) + geom_boxplot()
but this gives me the error:
Don't know how to automatically pick scale for object of type data.frame.
Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:cond
Can someone help me out?
I believe your problem comes from the fact that the column medical_speciality is a factor.
So, even though you subset your data the right way, you still get all the levels (including "Ortho", "OBGYN", etc...).
You can get rid of them by using the function droplevels:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
newdata <- droplevels(newdata) ## THIS IS THE NEW ADDITION
plot(newdata)
Does this help?

Resources