Creating a subset in R using a double loop [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a very large csv file I have imported into R and need to make a subset of data.
The csv looks something like this:
Julian_Day Id Year
52 1 1901
56 5 1901
200 1 1968
ect, where year is 1901-2010, Id 1-58 and Julian_Day 1-200 for about 130,000 rows of data. So I only want the lowest Julian Day value for each year for each Id and to get rid of all other rows of data.

Data:
df = data.frame(Year=c(1901,1901,1968,1901),
Id=c(1,5,1,1),
Julian_Day=c(52,56,200,40),
Animal=c('dog','doggy','style','fashion'))
Try this:
library(data.table)
setDT(df)[ ,min:=min(Julian_Day), by=list(Id, Year)]
#>df
# Year Id Julian_Day Animal min
#1: 1901 1 52 dog 40
#2: 1901 5 56 doggy 56
#3: 1968 1 200 style 200
#4: 1901 1 40 fashion 40

Or simply with base R
aggregate(Julian_Day ~., df, min)
# Year Id Julian_Day
# 1 1901 1 40
# 2 1968 1 200
# 3 1901 5 56
Or
library(dplyr)
df %>%
group_by(Id, Year) %>%
summarise(Julian_Day = min(Julian_Day))
# Source: local data frame [3 x 3]
# Groups: Id
#
# Id Year Julian_Day
# 1 1 1901 40
# 2 1 1968 200
# 3 5 1901 56

Related

r - Fill in missing years in Data frame [duplicate]

This question already has answers here:
Extend an irregular sequence and add zeros to missing values
(9 answers)
Closed 1 year ago.
I have some data in R that looks like this.
year freq
<int> <int>
1902 2
1903 2
1905 1
1906 4
1907 1
1908 1
1909 1
1912 1
1914 1
1915 1
The data was read in using the following code.
data = read.csv("earthquakes.csv")
my_var <- c('year')
new_data <- data[my_var]
counts <- count(data, 'year')
This is 1 page of a 7 page table. I need to fill in the missing years with a count of 0 from 1900-1999. How would I go about this? I haven't been able to find an example online where year is the primary column.
We may use complete on the 'counts' data
library(tidyr)
complete(counts, year = 1990:1999, fill = list(freq = 0))
1) Convert the input, shown in the Note, to zoo class and then to ts class. The latter will fill iln the missing years with NA. Replace the NA's with 0, convert back to data frame and set the names to the original names.
If a ts series is ok as output then omit the last two lines. If in addition it is ok to use NA rather than 0 then omit the last three lines.
library(zoo)
DF |>
read.zoo() |>
as.ts() |>
na.fill(0) |>
fortify.zoo() |>
setNames(names(DF))
giving:
year freq
1 1902 2
2 1903 2
3 1904 0
4 1905 1
5 1906 4
6 1907 1
7 1908 1
8 1909 1
9 1910 0
10 1911 0
11 1912 1
12 1913 0
13 1914 1
14 1915 1
2) for a base solution use merge. Omit the last line if NA is ok instead of 0.
m <- merge(DF, data.frame(year = min(DF$year):max(DF$year)), all = TRUE)
transform(m, freq = replace(freq, is.na(freq), 0))
Note
Lines <- "year freq
1902 2
1903 2
1905 1
1906 4
1907 1
1908 1
1909 1
1912 1
1914 1
1915 1"
DF <- read.table(text = Lines, header = TRUE)

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

Aggregate a column by defined weeks in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have the following table and I need to aggregate the columns from 4 to 5 based on the defined weeks below for a given month.
for example for any given month my weekly definition for purchase date column as follows:
wk1: 1-6 days
wk2: 7-12 days
wk3: 13-18 days
wk4: 19-24 days
wk5: 25-31 days
Year County purchase_Date acres_purchase Date_Diff
2010 Cache 9/28/2009 30.5 1
2010 Cache 10/1/2009 5.0 4
2010 Cache 10/3/2009 10.2 3
2010 Cache 10/5/2009 20 3
2010 Cache 10/7/2009 15 5
2010 Cache 10/13/2009 5 1
2010 Cache 10/14/2009 6 2
2010 Cache 10/19/2009 25 7
2010 Cache 10/25/2009 12 3
2010 Cache 10/30/2009 2 1
Output:
Year County purchase_Date Week purchase_by_date Date_Diff
2010 Cache 9/28/2009 Sep-wk5 30.5 1
2010 Cache 10/1/2009 Oct-wk1 35.2 10
2010 Cache 10/7/2009 Oct-wk2 15 5
2010 Cache 10/13/2009 Oct-wk3 11 3
2010 Cache 10/19/2009 Oct-wk4 25 7
2010 Cache 10/25/2009 Oct-wk5 14 4
Is there a way that I can achieve "output" table in R?
Any help is appreciated.
First convert purchase_Date to a date class, then extract purchase_Day:
df1$purchase_Date <- as.Date(df1$purchase_Date, format= "%m/%d/%Y")
df1$purchase_Day <- as.numeric(format(df1$purchase_Date, "%d"))
Define a helper function to assign each range of days to the correct week.
weekGroup <- function(x){
if (x <= 6) {
week <- "wk1"
} else if (x <= 12) {
week <- "wk2"
} else if (x <= 18) {
week <- "wk3"
} else if (x <= 24) {
week <- "wk4"
} else {
week <-"wk5"
}
return(week)
}
Pass each day to our helper function:
df1$week <- sapply(df1$purchase_Day, weekGroup)
Pull the month into a separate column, and convert to numeric
df1$month <- as.numeric(format(df1$purchase_Date, "%m"))
month.abb is a list of the month abbreviations. Use the numeric month to call the respective list element
df1$monthAbb <- sapply(df1$month, function(x) month.abb[x])
Combine week and monthAbb
df1$monthWeek <- paste(df1$monthAbb,df1$week, sep="-")
And #cmaher basically provided this already, but for completeness, the final summary:
require(dplyr)
df1 %>% group_by(Year, County,monthWeek) %>%
summarise(purchaseDate=min(purchase_Date),acres=sum(acres_purchase),
date_diff=sum(Date_Diff))
Year County monthWeek purchaseDate acres date_diff
<int> <fctr> <chr> <date> <dbl> <int>
1 2010 Cache Oct-wk1 2009-10-01 35.2 10
2 2010 Cache Oct-wk2 2009-10-07 15.0 5
3 2010 Cache Oct-wk3 2009-10-13 11.0 3
4 2010 Cache Oct-wk4 2009-10-19 25.0 7
5 2010 Cache Oct-wk5 2009-10-25 14.0 4
6 2010 Cache Sep-wk5 2009-09-28 30.5 1
Assuming your purchase_Date variable is of class Date, you can use lubridate::day() and base::findInterval to segment your dates:
df$Week <- findInterval(lubridate::day(df$purchase_Date), c(7, 13, 19, 25, 32)) + 1
df$Week <- as.factor(paste(lubridate::month(df$purchase_Date), df$Week, sep = "-"))
# purchase_Date Week
# 2017-10-01 10-1
# 2017-10-02 10-1
# 2017-10-03 10-1
# ...
# 2017-10-29 10-5
# 2017-10-30 10-5
# 2017-10-31 10-5
Then, one way to achieve your target output is with dplyr like so:
df %>% group_by(Year, Country, Week) %>%
summarize(
purchase_Date = min(purchase_Date),
purchase_by_date = sum(acres_purchase),
Date_Diff = sum(Date_Diff))

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

How to compute the daily average from hourly values?

I have a text file consisting of 6 columns as shown below. the measurements are taken each 30 mint for several years (2001-2013). I want to compute the daily average so for example: for 2001 take all values correspond to the first day (1) and compute the average and do this for all days in that year and also for all years available in the text file.
to read the file:
LR=read.table("C:\\Users\\dat.txt", sep ='', header =TRUE)
header:
head(LR)
Year day hour mint valu1 valu2
1 2001 1 5 30 0 0
2 2001 1 6 0 1 0
3 2001 1 6 30 2 0
4 2001 1 7 0 0 7
5 2001 1 7 30 5 8
6 2001 1 8 0 0 0
Try:
library(plyr)
ddply(LR, .(Year, day), summarize, val = mean(valu1))
And another less elegant option:
LR$n <- paste(LR$Year, LR$day, sep="-")
tapply(LR$valu1, LR$n, FUN=mean)
If you want to select a certain range of years use subset:
dat < ddply(LR, .(Year, day), summarize, val = mean(valu1))
subset(dat, Year > 2003 & Year < 2005)
You can try aggregate:
res <- aggregate(LR, by = list(paste0(dat$Year, dat$day)), FUN = mean)
## You can remove the extra columns if you want
res[, -c(1,4,5)]
Or as Michael Lawrence suggests, using the formula interface:
aggregate(cbind(valu1, valu2) ~ Year + day, LR, mean)

Resources