Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have some data that look like this:
head(t)
sub trialnum block.x lat.x block.y lat.y diff
1 1 10 3 1355 5 1337 18
2 1 11 3 1324 5 1470 -146
3 1 12 3 1861 5 1690 171
4 1 13 3 3501 5 1473 2028
5 1 14 3 1566 5 1402 164
6 1 15 3 1380 5 1539 -159
What I would like to do is reformat the data in R such that the values of "trialnum" (there are 20 of them) are the new columns, "sub" is the row values, and each cell has the "diff" value. For example
trialnum1 trialnum2 trialnum3...
sub
1
2
3
.
.
.
Any help would be much appreciated. Although the answer is probably simple, I've been struggling with this problem for some time.
Base package. We transpose column diff with the function t(x), then create the desired column names.
df <- data.frame(t(t[, 7]))
# Using the trialnum column
colnames(df) <- paste0(colnames(t[2]), t[, 2])
# or just the number of rows
colnames(df) <- paste0(colnames(t[2]), 1:nrow(t))
Output:
trialnum10 trialnum11 trialnum12 trialnum13 trialnum14 trialnum15
1 18 -146 171 2028 164 -159
trialnum1 trialnum2 trialnum3 trialnum4 trialnum5 trialnum6
1 18 -146 171 2028 164 -159
With dplyr and tidyr, first get rid of the columns you don't want, then spread trialnum and diff.
library(dplyr)
library(tidyr)
t %>% select(-block.x:-lat.y) %>% # get rid of extra columns so t will collapse
mutate(trialnum = paste0('trialnum', trialnum)) %>% # fix values for column names
spread(trialnum, diff) # spread columns
# sub trialnum10 trialnum11 trialnum12 trialnum13 trialnum14 trialnum15
# 1 1 18 -146 171 2028 164 -159
Data
t <- structure(list(sub = c(1L, 1L, 1L, 1L, 1L, 1L), trialnum = 10:15,
block.x = c(3L, 3L, 3L, 3L, 3L, 3L), lat.x = c(1355L, 1324L,
1861L, 3501L, 1566L, 1380L), block.y = c(5L, 5L, 5L, 5L,
5L, 5L), lat.y = c(1337L, 1470L, 1690L, 1473L, 1402L, 1539L
), diff = c(18L, -146L, 171L, 2028L, 164L, -159L)), .Names = c("sub",
"trialnum", "block.x", "lat.x", "block.y", "lat.y", "diff"), row.names = c(NA,
-6L), class = "data.frame")
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Given a time series entailing data about cinemas, the identifier "dates" are of interest. I would like to convert into the format "YYYY/MM/DD." However, when I run my code:
CINEMA.TICKET$DATE <- as.Date(CINEMA.TICKET$date , format = "%y/%m/%d")
Two issues occur:
First, the dates are shown on the far right of the table as, e.g. , "0005-05-20." And many entries disappear entirely. Can someone explain what I am doing wrong, and how can I do it properly?
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day newdate DATE
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 5/5/2018 5 2 5 0005-05-20 2005-05-20
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 5/5/2018 5 2 5 0005-05-20 2005-05-20
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 5/5/2018 5 2 5 0005-05-20 2005-05-20
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 5/5/2018 5 2 5 0005-05-20 2005-05-20
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 5/5/2018 5 2 5 0005-05-20 2005-05-20
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 5/5/2018 5 2 5 0005-05-20 2005-05-20
> str(CINEMA.TICKET)
As #Dave2e pointed out. You are looking for:
CINEMA.TICKET[, date := as.Date(date , format = "%d/%m/%Y")]
assuming our input format is "30/5/2018" since question is not clear with an example of "5/5/2018" where this could be "%d/%m/%Y" or "%m/%d/%Y"
As for ordering columns use:
setcolorder(CINEMA.TICKET, c("c", "b", "a"))
where c,b,a are column names in their desired order
lubridate probably does the trick
> lubridate::mdy("5/5/2018")
[1] "2018-05-05"
So you should use
library(lubridate)
library(tidyverse)
CINEMA.TICKET <- CINEMA.TICKET %>%
mutate(DATE=mdy(date))
Here is another option:
library(tidyverse)
output <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y"))
Output
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 2018-05-05 5 2 5
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 2018-05-05 5 2 5
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 2018-05-05 5 2 5
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 2018-05-05 5 2 5
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 2018-05-05 5 2 5
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 2018-05-05 5 2 5
To have date classified as a date, you cannot have the forward slash. You can change the format, but it will no longer be classified as date, but will be classified as character again.
class(output$date)
# [1] "Date"
output2 <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y")) %>%
mutate(date = format(date, "%Y/%m/%d"))
class(output2$date)
# [1] "character"
Data
df <-
structure(
list(
film_code = c(1492L, 1492L, 1492L, 1492L, 1492L,
1492L),
cinema_code = c(304L, 352L, 489L, 429L, 524L, 71L),
total_sales = c(3900000L,
3360000L, 2560000L, 1200000L, 1200000L, 1050000L),
tickets_sold = c(26L,
42L, 32L, 12L, 15L, 7L),
tickets_out = c(0L, 0L, 0L, 0L, 0L,
0L),
show_time = c(4L, 5L, 4L, 1L, 3L, 3L),
occu_perc = c(4.26,
8.08, 20, 11.01, 16.67, 0.98),
ticket_price = c(150000L, 80000L,
80000L, 100000L, 80000L, 150000L),
ticket_use = c(26L, 42L, 32L,
12L, 15L, 7L),
capacity = c(610.3286, 519.802, 160, 108.9918,
89.982, 714.2857),
date = c("5/5/2018", "5/5/2018", "5/5/2018", "5/5/2018",
"5/5/2018", "5/5/2018"),
month = c(5L, 5L, 5L, 5L, 5L, 5L),
quarter = c(2L,
2L, 2L, 2L, 2L, 2L),
day = c(5L, 5L, 5L, 5L, 5L, 5L)
),
class = "data.frame",
row.names = c(NA,-6L)
)
I have a data looks like this but way much bigger
df<- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
as an example, I am trying to remove -1 from all strings of the first column
I can do this with
as.data.frame(str_remove_all(df$names, "-1"))
the problem is that it will remove all other columns as well.
I dont want to split the data and merge again because I am afraid I Make a mismatch
Is there anyway without interrupting, just getting raid of specific strings?
for instance the output should looks like this
names Col col2
bests 1 2
trible 2 4
crazy NA 5
cool 4 7
nonsense 47 9
Mean 294 9
Lose 2 0
Try 1 2
Trified 3 3
Using gsub, escape the special \\-, and $ for end of string.
transform(df, names=gsub('\\-1$', '', names))
# names Col col2
# 1 bests 1 2
# 2 trible 2 4
# 3 crazy NA 5
# 4 cool 4 7
# 5 nonsense 47 9
# 6 Mean 294 9
# 7 Lose 2 0
# 8 Trye 1 2
# 9 Trified 3 3
Data:
df <- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
Using stringr package,
df$names = str_remove_all(df$names, '-1')
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3
We could use trimws from base R
df$names <- trimws(df$names, whitespace = "-\\d+")
-output
> df
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3
Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.
Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
NUMBER WEIGHT DAILY-LANG RELIGION PROVINCE DISTRICT SUB_DISTRI
5 9.50 1167 1 11 01 010
6 9.50 1167 1 11 01 010
7 9.50 1167 1 11 01 010
8 10.30 4 2 33 071 220
9 10.10 6 1 61 8 170
It is the data screen I have to find the daily_lang speaker numbers by each Sub_disrict
If thw colums WEIGHT, DAILY-LANG, RELIGION, PROVINCE, DISTRICT and SUB_DISTRI are unique for a speaker you can use nrow and unique to get the number of speakers.
nrow(unique(x))
#[1] 3
To get DAILY-LANG per RELIGION, PROVINCE, DISTRICT and SUB_DISTRI you can use unique, split and interaction:
y <- unique(x)
split(y$DAILY.LANG,
interaction(y[c("RELIGION", "PROVINCE", "DISTRICT", "SUB_DISTRI")], drop=TRUE))
#$`1.11.1.10`
#[1] 1167
#
#$`1.61.8.170`
#[1] 6
#
#$`2.33.71.220`
#[1] 4
Or if SUB_DISTRI is already unique:
split(y$DAILY.LANG, y$SUB_DISTRI)
#$`10`
#[1] 1167
#
#$`170`
#[1] 6
#
#$`220`
#[1] 4
Data:
x <- structure(list(WEIGHT = c(9.5, 9.5, 9.5, 10.3, 10.1), DAILY.LANG = c(1167L,
1167L, 1167L, 4L, 6L), RELIGION = c(1L, 1L, 1L, 2L, 1L), PROVINCE = c(11L,
11L, 11L, 33L, 61L), DISTRICT = c(1L, 1L, 1L, 71L, 8L), SUB_DISTRI = c(10L,
10L, 10L, 220L, 170L)), row.names = c(NA, -5L), class = "data.frame")
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a dataframe that looks like this:
ID Team
11 1
22 2
45 4
45 2
79 3
79 4
100 2
123 1
167 3
167 1
I have to subset only those rows which ARE duplicated until the end of the data frame is reached. How can it be done?
If you meant to subset rows that have duplicated IDs
dat <- structure(list(ID = c(11L, 22L, 45L, 45L, 79L, 79L, 100L, 123L,
167L, 167L), Team = c(1L, 2L, 4L, 2L, 3L, 4L, 2L, 1L, 3L, 1L)), .Names = c("ID",
"Team"), class = "data.frame", row.names = c(NA, -10L))
dat[duplicated(dat$ID)|duplicated(dat$ID,fromLast=T),]
# ID Team
# 3 45 4
# 4 45 2
# 5 79 3
# 6 79 4
# 9 167 3
# 10 167 1