Subset is not working - r

I have a dataset called "x" that produces the following records when I do head(x, 1)
VRTG_ID_NR EEG_VRTG_CAT_V GEBR_IDENT_KEUR PL_KEURING NAAM_VRTG_AANB DAT_RESULT_KEUR TYD_RESULT_KEUR KL_CODE_EU_1 KL_CODE_EU_2
1 VF1JA04N522215749 M1 NULL NULL NULL 20090527 906 6 NULL
RES_CODE_KEUR KENT_LAND_OORS LAND_HERK YEAR
1 GDK ME-QT 761 D 2009
I get the following when I show the classes of the relevant column
$YEAR
[1] "numeric"
I now want to create a subset where I only see data from the years 2009 en 2010. So I tried
x_subset <- x[x$YEAR >=2009 & <= 2011]
That however gives me the following error:
data frame with 0 columns and 992287 rows
While actually I want an overview with a subset of the records between 2009 and 2011...

If YEAR is a factor variable, first convert it to a numeric:
x$YEAR <- as.numeric(x$YEAR)
I think you are missing a comma:
x_subset <- x[x$YEAR >=2009 & x$YEAR <= 2011,]

Related

How to extract data from the dataset with a certain condition and how to combine data from two columns into one in R

This is my dataset example for one person:
dataset example for one person
I have made this table :
deathYear
diagYear
fcediags
pid
2013
NA
I21
1
2011
NA
I63
2
2033
NA
I21
4
2029
NA
I25
5
2020
NA
I21
18
2012
NA
I63
19
I have the problem with the data for the diagYear above. The results are NA.
And also:
The table T2 should only show the rows for persons that have at least one of these Diags: "I20","I21","I22","I25" or "I63" in the document data (no matter document$fces$alive=TRUE or FALSE), but (and only for these persons with this condition) it should also show the year of death (extracted from the date like in the code above - deathYear code) no matter the pearson died from some other diagnoses.
I also need to make one column Year instead of these two (deathYear and diagYear) which would contain the data for the year (extracted from the date - document$FCEs$date (pls see the picture) depending on the next conditions: 1. if document$fces$alive is TRUE, the Year column should have the data for the year only if there's at least one Diag1 in the person's document set that is either "I20", "I21", "I22", "I25" or "I63"
2. if document$fces$alive is FALSE (but only for these persons from the condition 1.), then the column Year should have a deathYear data from the code above no matter the Diag value for the case of death (in this case Diag1, doesn't have to be "I20", "I21", "I22", "I25" or "I63").
I have tried these codes:
getDiags <- function(x) {
document<-fromJSON(x)
fcediags <- document$FCEs$Diag1
fcedage <- document$FCEs$pAge
fcealive <- document$FCEs$alive
deathYear<-2030
if(length(strsplit(document$FCEs[document$FCEs$alive==FALSE,]$date, "/"))>0)
deathYear<-as.numeric(strsplit(document$FCEs[document$FCEs$alive==FALSE,]$date, "/")[[1]][1])
diagYear<-0
v1 = c("I20","I21","I22","I25","I63")
for (i in 1:length(document$FCEs$Diag1)){
if (document$FCEs$Diag1[i] %in% v1){
diagYear<-as.numeric(strsplit(document$FCEs[document$FCEs$Diag1[i],]$date, "/")[[1]][1])
}
} #this block of code doesn't work, it shows NA in the table
return (data.frame(fcedage,fcediags,fcealive,sex,ldl,pid=document$ID,deathYear,diagYear))
}
for (i in 1:length(fces$fcediags)){
T2 <- subset(fces,fces$fcediags == "I20" | fces$fcediags == "I21" | fces$fcediags == "I22" | fces$fcediags == "I25" | fces$fcediags == "I63", select = c(deathYear,diagYear,fcediags,pid))
}
#I've obviously made this table wrong because it shows rows for only these "I20","I21",...,"I63" Diag1s, but for these persons (with these mentioned Diag1s), it should show the year of death (document$fces$alive=FALSE) no matter the Diag1 value for the case of death.
(pid is pearson's ID), but they are not good enough. Results in the column diagYear shouldn't be NA and the two columns should be merged in one.
Can someone please help me? Thank you in advance!

R programming; double indexing loop to find mean of subsetted data frame

I would like to run a "for" loop that uses three indices. Basically, I want to subset a data frame, find the mean of the subset, and place the mean value in a new data frame. I am having trouble running this loop; all I get is NaN's.
The first index is used to match the rows of the new data frame (which I call data.avg);
The second index is used to index to a vector that will be used in the first half of the subsetting condition (that the date values be from a specific month);
the second index is the same as the above, but for the second part of the subsetting condition (that the row is associated with a Breakfast/Dinner/Snacks).
# Create the data frame
data1 = data.frame(date = sort(rep(as.Date(42948:43101, origin = "1899-12-30"),3)),
serving = rep(c("Breakfast", "Dinner", "Snacks"), 154),
units = rep(c(1,5,49), 154)
)
View(data1[order(data1$date),])
# take mean of each subset and place it in a new data frame called data.avgs
# it should consist of 8x3 data frame; rows (column1) are "August","September", "October", "November", "December", "January","February", "March".
# columns should be "Breakfast", "Dinner", "Snack"
month.index = c(8:12, 1)
serving.index = c("Breakfast", "Dinner", "Snack")
# create the data frame with the means using placeholder data
data.avg = data.frame(months = c(month.name[8:12], month.name[1]),
bf.avg = c(1:6),
dinner.avg = c(1:6),
snack.avg = c(1:6))
# now start replacing; find the mean of the subset of the original data frame.
# find the mean of all dates that are for August, and whose serving type are for Breakfast.
for(j in 1:6){
for(i in month.index){
for(v in 2:4){
data.avg[j,v] = mean(
subset(data1,
months(data1$date) == month.name[i] & data1$serving == serving.index[v])$units
)
}
}
}
When I run the mean without the loop, for example, this;
mean(subset(data1,
months(data1$date) == "September" & data1$serving == "Breakfast")$unit)
I get the correct mean. Because of this, I am thinking that my issue may lie in the index setup.
Any and all help would be greatly appreciated,
Thanks
edit; fixed the above code. The resulting data frame is the following;
months bf.avg dinner.avg snack.avg
1 August 5 49 NaN
2 September 5 49 NaN
3 October 5 49 NaN
4 November 5 49 NaN
5 December 5 49 NaN
6 January 5 49 NaN
Here is what I am looking for;
mean(subset(data1,
+ months(data1$date) == "September" & data1$serving == "Breakfast")$unit)
[1] 1
> mean(subset(data1,
+ months(data1$date) == "September" & data1$serving == "Dinner")$unit)
[1] 5
> mean(subset(data1,
+ months(data1$date) == "September" & data1$serving == "Snacks")$unit)
[1] 49
My understanding is that these should be the data1.avg[1,1:3]
You set "Snack" in your serving.index, but you have "Snacks" in data1.
And then try this code in the for loop:
data.avg[j,v+1] = mean(
subset(data1,months(data1$date) == month.name[i] & as.character(data1$serving) == serving.index[v])$units)
data.avg
months bf.avg dinner.avg snack.avg
1 August 1 5 49
2 September 1 5 49
3 October 1 5 49
4 November 1 5 49
5 December 1 5 49
6 January 1 5 49

Add a column to a database with matching values from another database r

Sorry that my question is a little vague. I have two separated data bases (data1 as the first database and data2 as the second one) as follows:
Area Yr AllRev Totalcalls
A 2012 1021597.78 835
B 2013 1002968.21 833
c 2014 730345.93 65
d 2015 251956.26 232
e 2012 22408.71 25
...
Data 2:
Yr TotRev TotCalls
2012 160038596.0 131064
2013 399750664.0 312651
...
Now I want to add a column "RevPercent" to data 1 which is going to calculate the following value for each row:
100*data1$AllRev/data2$TotRev
However, if yr ==2012 for data1, I want it to read TotRev for 2012 from data2 to calculate the aformentioned value. I wrote the following line of code but I definitely am getting an error:
data1 <- cbind(data1,100*round(data1[,3]/data2[data2[,1]==data2[,2],2],4))
And the error is as follows:
In data2[, 1] == data2[,2] :
longer object length is not a multiple of shorter object leng
Any help is appreciated.
Thanks

Subsetting odd rows in r using seq

Hope it is not a too newbie question.
I am trying to subset rows from the GDP UK dataset that can be downloaded from here:
http://www.ons.gov.uk/ons/site-information/using-the-website/time-series/index.html
The dataframe looks more or less like that:
X ABMI
1 1948 283297
2 1949 293855
3 1950 304395
....
300 2013 Q2 381318
301 2013 Q3 384533
302 2013 Q4 387138
303 2014 Q1 390235
The thing is that for my analysis I only need the data for years 2004-2013 and I am interested in one result per year, so I wanted to get every fourth row from the dataset that lies between the 263 and 303 row.
On the basis of the following websites:
https://stat.ethz.ch/pipermail/r-help/2008-June/165634.html
(plus a few that i cannot quote due to the link limit)
I tried the following, each time getting some error message:
> GDPUKodd <- seq(GDPUKsubset[263:302,], by = 4)
Error in seq.default(GDPUKsubset[263:302, ], by = 4) :
argument 'from' musi mieæ d³ugoœæ 1
> OddGDPUK <- GDPUK[seq(263, 302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(263, 302, by = 4)) :
undefined columns selected
> OddGDPUKprim <- GDPUK[seq(263:302), by = 4]
Error in `[.data.frame`(GDPUK, seq(263:302), by = 4) :
unused argument (by = 4)
> OddGDPUK <- GDPUK[seq(from=263, to=302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(from = 263, to = 302, by = 4)) :
undefined columns selected
> OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to=GDPUK[302,] by = 4)]
Error: unexpected symbol in "OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to"
> GDPUK[seq(1,nrows(GDPUK),by=4),]
Error in seq.default(1, nrows(GDPUK), by = 4) :
could not find function "nrows"
To put a long story short: help!
Instead of trying to extract data based on row ids, you can use the subset function with appropriate filters based on the values.
For example if your data frame has a year column with values 1948...2014 and a quarter column with values Q1..Q4, then you can get the right subset with:
subset(data, year >= 2004 & year <= 2013 & quarter == 'Q1')
UDATE
I see your source data is dirty, with no proper year and quarter columns. You can clean it like this:
x <- read.csv('http://www.ons.gov.uk/ons/datasets-and-tables/downloads/csv.csv?dataset=pgdp&cdid=ABMI')
x$ABMI <- as.numeric(as.character(x$ABMI))
x$year <- as.numeric(gsub('[^0-9].*', '', x$X))
x$quarter <- gsub('[0-9]{4} (Q[1-4])', '\\1', x$X)
subset(x, year >= 2004 & year <= 2013 & quarter == 'Q1')
Your code GDPUK[seq(1,nrows(GDPUK),by=4),] actually works quite well for these purposes. The only thing you need to change is nrow for nrows.

select only rows that have the same id in r

ID Julian Month Year Location Distance
2 40749 July 2011 8300 39625
2 41425 May 2013 Hatchery 31325
3 40749 July 2011 6950 38625
3 41057 May 2012 Hatchery 31325
6 40735 July 2011 8300 39650
12 40743 July 2011 11025 42350
Above is the head() for the data frame I'm working with. It contains over 7,000 rows and 3,000 unique ID values. I want to delete all the rows that have only one ID value. Is this possible? Maybe the solution is in keeping only rows where the ID is repeated?
If d is your data frame, I'd use duplicated to find the rows that have recurring IDs. Using both arguments in fromLast gets you the first and last duplicate ID row.
d[(duplicated(d$ID, fromLast = FALSE) | duplicated(d$ID, fromLast = TRUE)),]
This double-duplicated method has a variety of uses:
Finding ALL duplicate rows, including "elements with smaller subscripts"
How to get a subset of a dataframe which only has elements which appear in the set more than once in R
How to identify "similar" rows in R?
Here is how I would do it:
new.dataframe <- c()
ids <- unique(dataframe$ID)
for(id in ids){
temp <- dataframe[dataframe$ID == id, ]
if(nrow(temp) > 1){
new.dataframe <- rbind(new.dataframe, temp)
}}
This will remove all the IDs that only have one row

Resources