I'm having some trouble when I try to merge two data frames. Here is an example:
Number <- c("1", "2", "3")
Letter <- factor(c("a", "b", "c"))
map <- data.frame(Number, Letter, row.names = c("Belgium", "Italy", "Senegal"))
This is my first data frame called "map", it looks like this:
Number Letter
Belgium 1 a
Italy 2 b
Senegal 3 c
And if I try to select by row and column I don't have any problem:
map["Belgium", "Number"]
[1] "1"
Here I have my second data frame called "calendar":
Month <- c("January", "February", "March")
calendar <- data.frame(Month, row.names = c("Belgium", "Italy", "Senegal"))
It looks like this:
Month
Belgium January
Italy February
Senegal March
The problem comes when I try to merge both data frames:
map.amp = merge(map, calendar, by = 0)
Row.names Number Letter Month
1 Belgium 1 a January
2 Italy 2 b February
3 Senegal 3 c March
Now, when I try to select a cell using rows and columns, the outcome is always NA
map.amp["Italy", "Month"]
[1] NA
map.amp["Belgium", "Number"]
[1] NA
How can I merge both data frames so I can keep using that kind of select function?
You have to re-set the row names:
row.names(map.amp) <- map.amp$Row.names
If you want to keep using those row names you have to set the Row.names column back to row names. tibble::column_to_rownames is a nice option for this:
map.amp <- merge(map, calendar, by = 0) %>% tibble::column_to_rownames(var = "Row.names")
map.amp[map.amp$Row.names =='Italy', 'Month']
Will work now as row.names is also a column now
You could use the answer in the comment by #thelatemail. Or use
subset(map.amp, Row.names =='Italy')[[ 'Month']] # first get matching rows but them narrow to named column.
or
subset(map.amp, Row.names =='Italy', 'Month') # third argument is for column selection
Related
I have a data frame of postcodes with a regional/metro classification assigned. In some instances, due to the datasource, the same postcode will occur with both a regional and metro classification.
POSTCODE REGON
1 3000 METRO
2 3000 REGIONAL
3 3256 METRO
4 3145 METRO
I am wondering how to remove the duplicate row and replace the region with "SPLIT" in these instances.
I have tried using the below code however this reassignes the entire dataset with either "METRO" or "REGIONAL"
test <- within(PC_ACTM, REGION <- ifelse(duplicated("Postcode"), "SPLIT", REGION))
The desired output would be
POSTCODE REGON
1 3000 SPLIT
2 3256 METRO
3 3145 METRO
Example data:
dput(PC_ACTM)
structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L), REGON = c("METRO",
"REGIONAL", "METRO", "METRO")), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Based on your title, you're looking for an ifelse() solution; perhaps this will suit?
PC_ACTM <- structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L),
REGION = c("METRO", "REGIONAL", "METRO", "METRO")),
class = "data.frame",
row.names = c("1", "2", "3", "4"))
PC_ACTM$REGION <- ifelse(duplicated(PC_ACTM$POSTCODE), "SPLIT", PC_ACTM$REGION)
PC_ACTM[!duplicated(PC_ACTM$POSTCODE, fromLast = TRUE),]
#> POSTCODE REGION
#> 2 3000 SPLIT
#> 3 3256 METRO
#> 4 3145 METRO
Created on 2022-04-07 by the reprex package (v2.0.1)
Consider ave to sequential count by group and then subset the last but before use ifslse to replace needed value for any group counts over 1. Below uses new base R 4.1.0+ pipe |>:
test <- within(
PC_ACTM, {
PC_SEQ <- ave(1:nrow(test), POSTCODE, FUN=seq_along)
PC_COUNT <- ave(1:nrow(test), POSTCODE, FUN=length)
REGION <- ifelse(
(PC_SEQ == PC_COUNT) & (PC_COUNT > 1), "SPLIT", REGION
)
}
) |> subset(
subset = PC_SEQ == PC_COUNT, # SUBSET ROWS
select = c(POSTCODE, REGION) # SELECT COLUMNS
) |> `row.names<-`(NULL) # RESET ROW NAMES
I have a very untidy data set something like this
A tibble: 200000 x 2
ChatData
<chr>
1 Sep 30, 2018 7:12pm
2 Person A
3 Hello
4 Sep 30, 2018 7:11pm
5 Person B
6 Hello there
7 Sep 30, 2018 7:10pm
8 Person A
...
As you can see it goes date, person name, comment, and repeats.
I am working on the problem and have a very complex method that adds a score column depending on the names etc....
I would like to transform this into something like this
Person A , Person B
Hello NA
NA Hello there
how's you, NA
...
(The date as a row name or third column would be great but not essential to the question)
Optimally I am looking for a dplyr/tidyverse solution
I am working with lots of data so no slow for loops etc..
Raw data to work with:
structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
If anyone is wondering I am analysing facebook messenger data, and this is the form it comes in when you download it.
Thank you.
In this case, your starting data set has only one column (aka feature). But in this case, there are three types of data that are encoded here about each message: a timestamp, the label of the person, and a message. It will be more useful to transform these into a table where each message is in its own row, and each column represents a different aspect of each observation, i.e. in long, or "tidy", format: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
In the approach below, the user first defines what features are repeated in the data set. I call them "headers" here, since I'm working toward a table where these are the column headers. Then the script adds that information to the data and converts the single-column data into a tidy format with one row per message, and one aspect of each message in each column.
Your requested output is a minor variation of this, addressed in the last line below: %>% spread(person, msg), which separates out the Person A and Person b data into separate columns.
library(tidyverse)
header_names <- c("timestamp", "person", "msg")
rows_per <- length(header_names)
data_length <- length(data$ChatData) / rows_per
data2 <- data %>%
mutate(msg_number = rep(1:(nrow(data)/rows_per), each=rows_per),
# This line repeats the header_names sequence for each msg
header = rep(header_names, data_length)) %>%
spread(header, ChatData) %>%
mutate(timestamp = lubridate::mdy_hm(timestamp)) %>%
spread(person, msg)
head(data2)
# A tibble: 2 x 4
msg_number timestamp `Person A` `Person B`
<int> <dttm> <chr> <chr>
1 1 2018-09-30 19:12:00 Hello NA
2 2 2018-09-30 19:11:00 NA Hello there
As you basically just have a character vector that you would like to convert into a 3 columnn data.frame
One other option is to simply use matrix and specify ncol=3 and byrow=TRUE
# your sample data
d <- structure(list(ChatData = c("Sep 30, 2018 7:12pm", "Person A", "Hello", "Sep 30, 2018 7:11pm", "Person B", "Hello there")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list( NULL, c("date_time", "person", "message")) )
Result is a character matrix:
date_time person message
[1,] "Sep 30, 2018 7:12pm" "Person A" "Hello"
[2,] "Sep 30, 2018 7:11pm" "Person B" "Hello there"
But you can wrap that in as.data.frame() to convert to a data.frame and continue working from there with dplyr if that's what you want.
Put it together for a whole solution:
It becomes a nice short, readable bit of code IMO:
library(dplyr)
library(lubridate)
result_df <-
matrix( d$ChatData, ncol=3, byrow=TRUE,
dimnames=list(NULL, c("date_time", "person", "message")) ) %>%
as.data.frame() %>%
mutate(date_time=lubridate::mdy_hm(date_time))
Here is one approach:
data %>% group_by(msg_number = rep(1:(nrow(data)/3), each=3)) %>%
summarize(msg_data = list(ChatData)) %>% as.data.frame
msg_number msg_data
1 1 Sep 30, 2018 7:12pm, Person A, Hello
2 2 Sep 30, 2018 7:11pm, Person B, Hello there
This numbers each message and puts the data into a column list.
I have two columns . both are of character data type.
One column has strings and other has got strings with quote.
I want to compare both columns and find the no. of distinct names across the data frame.
string f.string.name
john NA
bravo NA
NA "john"
NA "hulk"
Here the count should be 2, as john is common.
Somehow i am not able to remove quotes from second column. Not sure why.
Thanks
The main problem I'm seeing are the NA values.
First, let's get rid of the quotes you mention.
dat$f.string.name <- gsub('["]', '', dat$f.string.name)
Now, count the number of distinct values.
i1 <- complete.cases(dat$string)
i2 <- complete.cases(dat$f.string.name)
sum(dat$string[i1] %in% dat$f.string.name[i2]) + sum(dat$f.string.name[i2] %in% dat$string[i1])
DATA
dat <-
structure(list(string = c("john", "bravo", NA, NA), f.string.name = c(NA,
NA, "\"john\"", "\"hulk\"")), .Names = c("string", "f.string.name"
), class = "data.frame", row.names = c(NA, -4L))
library(stringr)
table(str_replace_all(unlist(df), '["]', ''))
# bravo hulk john
# 1 1 2
I need some help to re-design the output of a function that comes through an R package.
My scope is to reshape a dataframe called output_IMFData in a way that look very similar to the shape of output_imfr. The codes of a MWE reproducing these dataframes are:
library(imfr)
output_imfr <- imf_data(database_id="IFS", indicator="IAD_BP6_USD", country = "", start = 2010, end = 2014, freq = "A", return_raw =FALSE, print_url = T, times = 3)
and for output_IMFData
library(IMFData)
databaseID <- "IFS"
startdate <- "2010"
enddate <- "2014"
checkquery <- FALSE
queryfilter <- list(CL_FREA = "A", CL_AREA_IFS = "", CL_INDICATOR_IFS = "IAD_BP6_USD")
output_IMFData <- CompactDataMethod(databaseID, queryfilter, startdate, enddate,
checkquery)
the output from output_IMFData looks like this:
But, I want to redesign this dataframe to look like the output of output_imfr:
Sadly, I am not that advanced user and could not find something that can help me. My basic problem in converting the shape of output_IMFData to the shape of the second ``panel-data-looking" dataframework is that I don't know how to handle the Obs in output_IMFData in a way that cannot lose the "correspondence" with the reference code #REF-AREA in output_IMFData. That is, in column #REF-AREA there are codes of country names and the column in Obs has their respective time series data. This is very cumbersome way of working with panel data, and therefore I want to reshape that dataframe to the much nicer form of output_imfr dataframe.
The data of interest are stored in a list in the column Obs. Here is a dplyr solution to split the data, crack open the list, then stitch things back together.
longData <-
output_IMFData %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
iso2c = x[["#REF_AREA"]]
, x$Obs
)
}) %>%
bind_rows()
head(longData)
gives:
iso2c X.TIME_PERIOD X.OBS_VALUE X.OBS_STATUS
1 FJ 2010 47.2107721901621 <NA>
2 FJ 2011 48.28347 <NA>
3 FJ 2012 51.0823499999999 <NA>
4 FJ 2013 157.015648875072 <NA>
5 FJ 2014 186.623232882226 <NA>
6 AW 2010 616.664804469274 <NA>
Here's another approach:
NewDataFrame <- data.frame(iso2c=character(),
year=numeric(),
IAD_BP6_USD=character(),
stringsAsFactors=FALSE)
newrow = 1
for(i in 1:nrow(output_IMFData)) { # for each row of your cludgy df
for(j in 1:length(output_IMFData$Obs[[i]]$`#TIME_PERIOD`)) { # for each year
NewDataFrame[newrow,'iso2c']<-output_IMFData[i, '#REF_AREA']
NewDataFrame[newrow,'year']<-output_IMFData$Obs[[i]]$`#TIME_PERIOD`[j]
NewDataFrame[newrow,'IAD_BP6_USD']<-output_IMFData$Obs[[i]]$`#OBS_VALUE`[j]
newrow<-newrow + 1 # increment down a row
}
}
My Sample data set have the following look.
Country Population Capital Area
A 210000210 Sydney/Landon 10000000
B 420000000 Landon 42100000
C 500000 Italy42/Rome1 9200000
D 520000100 Dubai/Vienna21A 720000
How to delete the entire row with a pattern / in its column. I have tried to look in the following link R: Delete rows based on different values following a certain pattern, but it does not help.
You can try grepl
df[!grepl('[/]', df$Capital),]
# Country Population Capital Area
#2 B 420000000 Landon 42100000
library(stringr)
library(tidyverse)
df2 <- df %>%
filter(!str_detect(Capital, "\\/"))
# Country Population Capital Area
# 1 B 420000000 Landon 42100000
Data
df <- structure(list(Country = c("A", "B", "C", "D"), Population = c(210000210L,420000000L, 500000L, 520000100L),
Capital = c("Sydney/Landon", "Landon", "Italy42/Rome1", "Dubai/Vienna21A"),
Area = c(10000000L, 42100000L, 9200000L, 720000L)), class = "data.frame", row.names = c(NA,-4L))