I am a beginner trying to work with R, but constantly hitting walls.
I have a giant dataset (thousands of entries) that looks like this: there is a column for Latitude, Longitude and PlotCode.
I have more than one plot per Longitude and Latitude. I would like to create a new column with some sort of ID to all plots with the same latitude and Longitude.
Something that will look like this in the end:
Any suggestions? Thank you.
Welcome to SO! It's better to add data, desired outputs, attempts and so on in your question. However maybe you can find a solution with the package dplyr.
After installing it, you could do this:
library(dplyr)
# some data like yours
data_latlon <- data.frame(Lat = c(1,1,1,2,2,2,3,3,3)
, Long = c(45,45,45,12,12,12,23,23,23)
, PlotCode = c('a','a','a','b','b','b','c','c','c'))
data_latlon %>% # the pipe operator to have dplyr chains
group_by(Lat,Long) %>% # group by unique Lat and Long
summarise(PlotCodeGrouped = paste(PlotCode,collapse='')) # add a new column that collapse all the plot,
# you can specify how to separate
# with the collapse option, in
# this case nothing
# A tibble: 3 x 3
# Groups: Lat [?]
Lat Long PlotCodeGrouped
<dbl> <dbl> <chr>
1 1 45 aaa
2 2 12 bbb
3 3 23 ccc
EDIT
It's easier the code as you'd like the result:
data_latlon %>% # the pipe operator to have dplyr chains
group_by(Lat,Long, add=TRUE) # group by unique Lat and Long
# and add a ""hierarchical father"
# Groups: Lat, Long [3]
Lat Long PlotCode
<dbl> <dbl> <fct>
1 1. 45. a
2 1. 45. a
3 1. 45. a
4 2. 12. b
5 2. 12. b
6 2. 12. b
7 3. 23. c
8 3. 23. c
9 3. 23. c
I think I found the solution, what I needed is something called cluster ID.
dataframe <- transform(dataframe, Cluster_ID = as.numeric(interaction(Lat, Long, drop=TRUE)))
By grouping you mean sort / arrange them by PlotCode?
if so you can use sort function or you can use arrange function through
tidyverse / dplyr package
Related
I want to use a way where I can successfully match values for 'checks' which have no start and end times. At first I thought to use bilinear interpolation for this task, but then i thought that's too much complicated and rather I just need something very similar.
My data looks something like this:
df <- data.frame("ID" = c(A,A,A,A,A,B,B,B,B,B),
"Check"= c(1:5),
"Start_time" = c("start_a1","start_a2","start_a3","start_a4","start_a5","startb1","startb2","startb3",NA,"startb5"),
"end_time" = c("end_a1","end_a2","end_a3","end_a4","end_a5","end_b1","end_b2",NA,NA,"endb5")
)
so what I am ideally looking for any check which has missing start time and end time it should pick data from the next check's start time, not previous.
I am trying the following code block but its giving me an issue:
df$end_time[df$check==3 & is.na(df_main$end_time)]] <- df$start_time[df$check==5]
#this gives a length issue
Any advice would be helpful here, my dataset contains approx 5k rows, and each ID has a number of checks with start time and end time.
The tidyr package has a function fill() which does exactly this.
library(tidyr)
df %>%
group_by(ID) %>%
fill(c(Start_time,end_time),.direction='up')
# A tibble: 10 × 4
# Groups: ID [2]
ID Check Start_time end_time
<chr> <int> <chr> <chr>
1 A 1 start_a1 end_a1
2 A 2 start_a2 end_a2
3 A 3 start_a3 end_a3
4 A 4 start_a4 end_a4
5 A 5 start_a5 end_a5
6 B 1 startb1 end_b1
7 B 2 startb2 end_b2
8 B 3 startb3 endb5
9 B 4 startb5 endb5
10 B 5 startb5 endb5
The .direction="up" parameter means it takes the next non-missing value to fill in blanks. To use the previous value you would use .direction="down". And using .direction="updown" would use the next value unless there are no more non-missing values in that group, then it would take the previous non-missing value. (Useful in cases where the missing value is the last row of the group.)
I am extremely new to R and programming, so I don't even know how to describe my question very clearly, excuse me for using an example to further explain what I mean:
Say I have a data frame with 2 columns, first one being 10 different countries, second column is the rate of happiness (0-10). And country column could have lots of repeated ones, e.g.:
Column titles: Country Happiness
1st Column content: A,C,A,B,B,B,C,A,D,D....
2nd Column content: 10,9,3,4,4,5,6,9,10,6...
What I want to achieve is: get mean/median/mode for country A B C D respectively. So far using describe() function I can only get the MMM for all the numbers, rather than by country.
I wonder if there is a function to achieve this directly, or should I create subsets of each country first? How should I do it?
Many thanks.
You can do this best with dplyr but first you will have to write a function for the mode:
getmode <- function(v) {
uniqv <- unique(v[!is.na(v)])
uniqv[which.max(table(match(v, uniqv)))]
}
Now you can group_bythe grouping variable Country and use summarise to calculate the statistics:
library(dplyr)
df %>%
group_by(Country) %>%
summarise(Mean = mean(Happiness),
Median = median(Happiness),
Mode = getmode(Happiness))
Result:
# A tibble: 4 x 4
Country Mean Median Mode
* <chr> <dbl> <dbl> <int>
1 A 2.5 2.5 2
2 B 2 2 2
3 C 3 3 3
4 D 3.5 3.5 5
Data:
set.seed(12)
df <- data.frame(
Country = sample(LETTERS[1:4], 10, replace = T),
Happiness = sample(1:5, 10, replace = T)
)
This should be very simple but I can't figure out how to do It properly.
Given the following example dataframe:
telar <- data.frame(name=c("uno","dos","tres","cuatro","cinco"), id=c(1,2,3,1,2), test=c(10,11,12,13,14))
telar
name id test
1 uno 1 10
2 dos 2 11
3 tres 3 12
4 cuatro 1 13
5 cinco 2 14
I am trying to select all the rows that, for example, have a value of test that is bellow the average of al the values in the dataframe telar that have the same id value.
I have already properly grouped the values by id and computed their average like this, but I do not know how to perform the comparison.
> summarise(group_by(telar, id), test=mean(test))
A tibble: 3 x 2
id test
<dbl> <dbl>
1 1 11.5
2 2 12.5
3 3 12
Thank you!
You can simply create your groups and keep the values that are less than the mean, i.e.
library(dplyr)
telar %>%
group_by(name, id) %>%
filter(test < mean(test)) %>%
ungroup()
There is undoubtedly a way to do this without using data.table, but I offer it as a solution
library(data.table)
setDT(telar)
telar[, avg := mean(test), by = id][test < avg]
note I recommend if you're doing further analysis in data.frame after this, I recommend to return to a data.frame using setDF(telar)
Using base R, this can be done with ave
telar[with(telar, test < ave(test, id, name)),]
I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.
I have a dataset with which contains duplicates of the ident variable.
I need to select only 1 observation of each ident and it needs to be the newest value, i.e. the resulting data should contain the observation for the ident where the 'year' is the highest in the initial data set.
I believe a general case would look like this:
1. ident value year
2. A 1 19X1
3. A 2 19X2
4. B 4 19X2
5. B 2 19X1
6. B 1 19X3
7. C 1 19X4
8. C 2 19X1
(I could not order it in a proper table here, so please disregard the numbered list on the left)
Only, I have several hundred thousands obs.
Order of the resulting data set is not important to me.
Using library dplyr you can do something like this:
library(dplyr)
df %>% group_by(ident) %>% arrange(desc(year)) %>% slice(1)
Output will be as follows:
Source: local data frame [3 x 4]
Groups: ident [3]
X1. ident value year
(dbl) (chr) (int) (chr)
1 3 A 2 19X2
2 6 B 1 19X3
3 7 C 1 19X4
This assumes year is in a format where sorting in descending order makes it go from latest to oldest.
NOTE: that x1. column is a result of your input data above. I just read it as is.
Try
df <- do.call(rbind, lapply(split(df, df$ident),
function(x) x[which.max(x$year), ]))