How to transpose info in rows into one column in R? [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
I have 150 stops (Cod) and each one of this have a number of service that used.
| Cod | SERVICE1 | SERVICE2 | SERVICE3 | Position
------------------------------------------------------
| P05 | XRS10 | XRS07| XRS05| 12455
| R07 | FR05 | | | 4521
| X05 | XRS07 | XRS10| | 57541
I need to put all the services (SERVICE1,SERVICE2,SERVICE3) in one column. That means that I need the following result.
| Cod | SERVICE | Position
------------------------------------------------------
| P05 | XRS10 | 12455
| P05 | XRS07 | 12455
| P05 | XRS05 | 12455
| R07 | FR05 | 4521
| X05 | XRS07 | 57541
| X05 | XRS10 | 57541
There is any way to do this using the sqldf package of R. Or any kind of way to do it?

try this:
library(magrittr) ##used for the pipe, %>%
library(dplyr) ##for filtering observations and selecting columns
library(tidyr) ##for making your dataset long/tidy
new_data <- original_data %>%
tidyr::gather(key = service_type, value = SERVICE) %>%
dplyr::filter(!is.na(SERVICE)) %>%
dplyr::select(-service_type)
Unfortunately I am not familiar with sqldf
Note that if you want to keep the information on whether the service comes from SERVICE1, SERVICE2, or SERVICE3, you'll omit the last line (dplyr::select) entirely.

Related

How to replace empty spaces with values from adjacent colum that needs to be separated?

Hi everyone. I'm so sorry for my english. I need to separate the
domain data of some emails in a table. Then, if these mail data have
the domain of a country, this information must be moved to another
column that is incomplete in which the participants of a congress are
included. This for a relatively large database. I put an example
below.
| email | country |
| -------- | -------------- |
| naco#gmail.com | CO |
| monic45814#gmail.com | AR |
| jsalazar#chapingo.mx | |
| andresramirez#urosario.edu.co | |
| jeimy861491#hotmail.com | CL |
|jytvc#hotmail.com | |
Outcome should be
| email | country |
| -------- | -------------- |
| naco#gmail.com | CO |
| monic45814#gmail.com | AR |
| jsalazar#chapingo.mx | MX |
| andresramirez#urosario.edu.co | CO |
|jeimy861491#hotmail.com | CL |
|jytvc#hotmail.com | *NA* |
Thank you so much.
You can use str_extract to get the string after the last occurrence of "." and if_else to ignore rows that already have a country and rows which e-mail doesn't end with a country code:
df %>%
mutate(country = if_else(is.na(country) & str_extract(email, "[^.]+$") != "com", toupper(str_extract(email, "[^.]+$")), country))
small but not so small PS: I would always recommend to provide fake data when you are mentioning personal data like e-mail addresses
Here is a solution in base R.
Suppose:
df<-data.frame(email,country)
Then:
df$country<-ifelse(is.na(df$country)&sub(".*(.*?)[\\.|:]", "",df$email)!="com",sub(".*(.*?)[\\.|:]", "",df$email),paste(df$country))

Control digits in specific cells

I have a table that looks like this:
+-----------------------------------+-------+--------+------+
| | Male | Female | n |
+-----------------------------------+-------+--------+------+
| way more than my fair share | 2,4 | 21,6 | 135 |
| a little more than my fair share | 5,4 | 38,1 | 244 |
| about my fair share | 54,0 | 35,3 | 491 |
| a littles less than my fair share | 25,1 | 3,0 | 153 |
| way less than my fair share | 8,7 | 0,7 | 51 |
| Can't say | 4,4 | 1,2 | 31 |
| n | 541,0 | 564,0 | 1105 |
+-----------------------------------+-------+--------+------+
Everything is fine but what I would like to do is to show no digits in the last row at all since they show the margins (real cases). Is there any chance in R I can manipulate specific cells and their digits?
Thanks!
You could use ifelse to output the numbers in different formats in different rows, as in the example below. However, it will take some additional finagling to get the values in the last row to line up by place value with the previous rows:
library(knitr)
library(tidyverse)
# Fake data
set.seed(10)
dat = data.frame(category=c(LETTERS[1:6],"n"), replicate(3, rnorm(7, 100,20)))
dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
kable(align="lrrr")
|category | X1| X2| X3|
|:--------|-----:|-----:|-----:|
|A | 100.4| 92.7| 114.8|
|B | 96.3| 67.5| 101.8|
|C | 72.6| 94.9| 80.9|
|D | 88.0| 122.0| 96.1|
|E | 105.9| 115.1| 118.5|
|F | 107.8| 95.2| 109.7|
|n | 76| 120| 88|
The huxtable package makes it easy to decimal-align the values (see the Vignette for more on table formatting):
library(huxtable)
tab = dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
hux %>% add_colnames()
align(tab)[-1] = "."
tab
Here's what the PDF output looks like when knitted to PDF from an rmarkdown document:

How do I convert the Column level values into a single row in R? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have a dataframe similar to below,
Name | ID | SET | COUNT |
------ | ------ |------ | ------ |
Value | 44000001005 | 0 | 24 |
Value | 10000000019659 | 0 | 29 |
Value | 10000000019659 | 1 | 5 |
The result that I need is something like,
Name | ID | 0 | 1 |
------ | ------ |------ | ------ |
Value | 44000001005 | 24 | 0 |
Value | 10000000019659 | 29 | 5 |
Can this be done or would I have to re-work the data set?
I am relatively new to R, so I may have missed some very obvious logic, but would appreciate if anyone could guide me.
Thank you.
If you want to change the format from a long to a wide format you can use the spread function from the tidyr package. There are other packages and possibilities, but this is my favorite.
If you are new to R, be aware that you have to install the package first with install.packages("tidyr").
Name <- c("Value","Value","Value")
ID <- c(6546465445,5464564,5464564)
SET <- c(0,0,1)
COUNT <- c(24,29,5)
df <- cbind.data.frame(Name,ID,SET,COUNT,stringsAsFactors=FALSE)
library(tidyr)
spread(data=df,key=SET,value = COUNT,fill=0) -> df_wide
see the documentation ?spread for details about the function.

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources