I have the dataset like:
name state num1 num2 num3
abc rt 10 40 8
def ka 20 50 15
ert pn 30 60 16
i want rowsums of each row.while using rowsums(data) , its throwing the error like x should be numeric. so the new column should be total of num1,num2 and num3
some of the suuggestd solutions. However, first, as always, creating some date,
dta <- structure(list(name = structure(1:3, .Label = c("abc", "def",
"ert"), class = "factor"), state = structure(c(3L, 1L, 2L), .Label = c("ka",
"pn", "rt"), class = "factor"), num1 = c(10L, 20L, 30L), num2 = c(40L,
50L, 60L), num3 = c(8L, 15L, 16L)), .Names = c("name", "state",
"num1", "num2", "num3"), class = "data.frame", row.names = c(NA,
-3L))
Second, almost always, show the data,
dta
#> name state num1 num2 num3
#> 1 abc rt 10 40 8
#> 2 def ka 20 50 15
#> 3 ert pn 30 60 16
maybe also use str() as it's relevant to understand the spciac problem here,
str(dta)
#> 'data.frame': 3 obs. of 5 variables:
#> $ name : Factor w/ 3 levels "abc","def","ert": 1 2 3
#> $ state: Factor w/ 3 levels "ka","pn","rt": 3 1 2
#> $ num1 : int 10 20 30
#> $ num2 : int 40 50 60
#> $ num3 : int 8 15 16
The problem originate in that the data is a mix of factors and integers, obliviously we cannot sum factors
Now to some solutions.
First, akrun's first solution,
rowSums(dta[grep("num\\d+", names(dta))])
#> [1] 58 85 106
Second, Renu's solution,
rowSums(dta[,sapply(dta, is.numeric)])
#> [1] 58 85 106
Third, a slightly reword version of akrun's second solution,
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta %>% select(matches("num\\d+")) %>% mutate(rowsum = rowSums(.))
#> num1 num2 num3 rowsum
#> 1 10 40 8 58
#> 2 20 50 15 85
#> 3 30 60 16 106
Finally, this nice plyr option,
# install.packages(c("plyr"), dependencies = TRUE)
plyr::numcolwise(sum)(dta)
#> num1 num2 num3
#> 1 60 150 39
Finally, here a almost identical question. Now they are at lest linked.
Related
I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))
I have a fairly straightforward need, but I can't find a previously asked question that is similar enough. I've been trying with dplyr, but can't figure it out.
julian year
088 22
049 19
041 22
105 18
125 22
245 20
What I want is for each value where data$julian < 105, subtract '1' from data$year, so that
julian year
088 21
049 18
041 21
105 18
125 22
245 20
OP asked about using dplyr in the post. Here, is one with dplyr
library(dplyr)
df1 <- df1 %>%
mutate(year = case_when(as.numeric(julian) < 105 ~ year -1,
TRUE ~ as.numeric(year)))
-output
df1
julian year
1 088 21
2 049 18
3 041 21
4 105 18
5 125 22
6 245 20
data
df1 <- structure(list(julian = c("088", "049", "041", "105", "125",
"245"), year = c(22L, 19L, 22L, 18L, 22L, 20L)), row.names = c(NA,
-6L), class = "data.frame")
Another option with base R:
df$year[df$julian < 105] <- df$year[df$julian < 105] - 1
Output
julian year
1 088 21
2 049 18
3 041 21
4 105 18
5 125 22
6 245 20
Data
df <- structure(list(name = c("KKSWAP", "KKSWAP"), code = c("The liquidations code for Marco are: 51-BMR05, 74-VAD08, 176-VNF09.",
"The liquidations code for Clara are: 88-BMR05, 90-VAD08, 152-VNF09."
)), class = "data.frame", row.names = c(NA, -2L))
for easier explanation I'm gonna use a smaller example.
I have two DF:
DF1: T01 T02 T03 T04 T05
1 15 20 48 25 5
2 12 18 35 30 12
3 13 15 50 60 42
DF2: MEDIAN SD
T01 13 1.24
T02 18 2.05
T03 45 6.64
T04 30 15.45
T05 12 16.04
What I want to do is create a loop that adds a dummy to DF1 for each variable, that take value 1 if DF1$T01 ≈ (almost equal) to DF2$MEDIAN[1], and 0 if it's not, and then goes to T02, T03, until it breaks.
Until now, I haven't been able to create a loop (I'm not really good at creating loops tho) that makes this. I did manage to make the dummy for one of the variables (T01), but in the real DF I have over 40 variables, so doing it by hand it´s not efficient at all. What I have right now is:
DF1$dummyt01 <- ifelse(almost.equal(DF1$T01, DF2$MEDIAN[1], tolerance = 2),1,0)
outcome expected:
DF1: T01 T02 T03 T04 T05 dummyT01 dummyT02 ... dummyT05
1 15 20 48 25 5 1 1 ... 0
2 12 18 35 30 12 1 1 ... 1
3 13 15 50 60 42 1 0 ... 0
Note: Not a native english speaker. Sorry for any mistakes.
EDIT: Expected Outcome.
We may use tidyverse. Loop across the columns of 'DF1', get the column names of that column looped (cur_column()), use that to subset the 'DF2' (as row names) 'MEDIAN' element, do the comparison with almost.equal to return a logical vector, which is coerced to binary with as.integer or +. In the .names add the prefix 'dummy' so as to create as new columns
library(dplyr)
library(berryFunctions)
DF1 <- DF1 %>%
mutate(across(everything(), ~ +(almost.equal(.,
DF2[cur_column(), "MEDIAN"], tolerance = 1)),
.names = "dummy{.col}"))
-output
DF1
T01 T02 T03 T04 T05 dummyT01 dummyT02 dummyT03 dummyT04 dummyT05
1 15 20 48 25 5 0 0 0 0 0
2 12 18 35 30 12 1 1 0 1 1
3 13 15 50 60 42 1 0 0 0 0
Or using a for loop
for(i in seq_along(DF1))
DF1[paste0('dummy', names(DF1)[i])] <- +(almost.equal(DF1[[i]],
DF2[names(DF1)[i], "MEDIAN"], tolerance = 1))
data
DF1 <- structure(list(T01 = c(15L, 12L, 13L), T02 = c(20L, 18L, 15L),
T03 = c(48L, 35L, 50L), T04 = c(25L, 30L, 60L), T05 = c(5L,
12L, 42L)), class = "data.frame", row.names = c("1", "2",
"3"))
DF2 <- structure(list(MEDIAN = c(13L, 18L, 45L, 30L, 12L), SD = c(1.24,
2.05, 6.64, 15.45, 16.04)), class = "data.frame", row.names = c("T01",
"T02", "T03", "T04", "T05"))
I need to aggregate data in R. I have 8 columns, 3 of which are categorical and 5 of which are numeric and need to be summed conditionally based off of a combination of conditions from 2 of the categorical variables. My data looks like the below:
df <- structure(list(Color = c("Red", "Blue", "Blue", "Red", "Yellow"
), Weekend = c(1L, 0L, 1L, 0L, 1L), LeapYear = c(1L, 1L, 0L,
0L, 0L), Length = c(15L, 20L, 10L, 15L, 15L), Height = c(50L,
70L, 35L, 28L, 80L), Weight = c(120L, 130L, 120L, 105L, 140L),
Cost = c(25L, 50L, 55L, 65L, 80L), Purchases = c(5L, 10L,
5L, 10L, 15L)), class = "data.frame", row.names = c(NA, -5L
))
> df
Color Weekend LeapYear Length Height Weight Cost Purchases
1 Red 1 1 15 50 120 25 5
2 Blue 0 1 20 70 130 50 10
3 Blue 1 0 10 35 120 55 5
4 Red 0 0 15 28 105 65 10
5 Yellow 1 0 15 80 140 80 15
I want to aggregate this table with conditional summations,
for example, sum Length and Height, but only for Leap Years, sum Height and Cost, but only for Leap Years and Weekends.
And I want these conditional summations grouped by color to look like the below:
Color
Length
Height
Weight
Cost
Purchases
Length_LeapYear
Height_LeapYear
Height_LeapYear_Weekend
Cost_LeapYear_Weekend
Purchases_Weekend
Red
30
78
225
90
15
15
50
50
25
5
Blue
30
105
250
105
15
20
70
0
0
5
Yellow
15
80
140
80
15
0
0
0
0
15
I am working in dplyr and have the following working to sum multiple fields on the same condition using summarise_at():
df %>%
group_by(Color, Weekend, LeapYear) %>%
summarise_at(c(Length_LeapYear == "Length", Height_LeapYear == "Height"), ~sum(.[LeapYear==1]))
But when I try to add conditions for my remaining conditionally summed variables, this removes my prior summarizations. Here is my idea for how I imagine the code to work.
df %>%
group_by(Color, Weekend, LeapYear) %>%
summarise_at(c("Length", "Height", "Weight", "Cost", "Purchases"), sum) %>%
summarise_at(c(Length_LeapYear == "Length", Height_LeapYear == "Height"), ~sum(.[LeapYear==1])) %>%
summarise_at(c(Height_LeapYear_Weekend == "Height", Cost_LeapYear_Weekend == "Cost"), ~sum(.[LeapYear==1 & Weekend ==1])) %>%
summarise(Purchases_Weekend = sum(Purchases)) %>%
group_by(Color)
Ultimately, I feel like there must be a way to get each of these differently conditioned summations into one call of summarise_at(). I also am unsure of the best practice for summing conditionally on columns (Weekend and LeapYear) an then omitting those columns from the final table. So help on that would be appreciated as well.
For the record, I do know that I can perform these manipulations with one long call to summarise(), where I individually condition each derived column.
However, in practice, my dataset is a lot wider than this, and it just makes more sense to try to condense the data manipulation by grouping like conditions.
UPDATE On second thoughts I understood that you need to do it at once. I think the below syntax will do the job of summarising whole dataset (in the example cols 3 to col7) by four types of aggregation, at once
df %>% group_by(Color) %>%
summarise(across(3:7, ~sum(.))) %>%
left_join(df %>% group_by(Color) %>% summarise(across(3:7, ~sum(.*LeapYear), .names= "{.col}_LeapYear"))) %>%
left_join(df %>% group_by(Color) %>% summarise(across(3:7, ~sum(.*Weekend), .names= "{.col}_Weekend"))) %>%
left_join(df %>% group_by(Color) %>% summarise(across(3:7, ~sum(.*LeapYear*Weekend), .names= "{.col}_LeapYear_Weekend")))
# A tibble: 3 x 21
Color Length Height Weight Cost Purchases Length_LeapYear Height_LeapYear Weight_LeapYear Cost_LeapYear
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Blue 30 105 250 105 15 20 70 130 50
2 Red 30 78 225 90 15 15 50 120 25
3 Yell~ 15 80 140 80 15 0 0 0 0
# ... with 11 more variables: Purchases_LeapYear <int>, Length_Weekend <int>, Height_Weekend <int>,
# Weight_Weekend <int>, Cost_Weekend <int>, Purchases_Weekend <int>, Length_LeapYear_Weekend <int>,
# Height_LeapYear_Weekend <int>, Weight_LeapYear_Weekend <int>, Cost_LeapYear_Weekend <int>,
# Purchases_LeapYear_Weekend <int>
You can also pass on complete functions in a list too, like this (which will shorten your code further)
df %>% group_by(Color) %>%
summarise(across(3:7, list(sum= ~sum(.),
leapyear = ~sum(.*LeapYear),
weekend = ~sum(.*Weekend),
leapyear_weekend = ~sum(.*Weekend*LeapYear))))
# A tibble: 3 x 21
Color Length_sum Length_leapyear Length_weekend Length_leapyear~ Height_sum Height_leapyear Height_weekend
<chr> <int> <int> <int> <int> <int> <int> <int>
1 Blue 30 20 10 0 105 70 35
2 Red 30 15 15 15 78 50 50
3 Yell~ 15 0 15 0 80 0 80
# ... with 13 more variables: Height_leapyear_weekend <int>, Weight_sum <int>, Weight_leapyear <int>,
# Weight_weekend <int>, Weight_leapyear_weekend <int>, Cost_sum <int>, Cost_leapyear <int>,
# Cost_weekend <int>, Cost_leapyear_weekend <int>, Purchases_sum <int>, Purchases_leapyear <int>,
# Purchases_weekend <int>, Purchases_leapyear_weekend <int>
sample dput(df) I have included in your question.
OLD ANSWER Do it like this
df %>%
group_by(Color) %>%
summarise(Length_s = sum(Length),
Height_s = sum(Height),
Weight_s = sum(Weight),
Cost_s = sum(Cost),
Purchases_s = sum(Purchases),
Length_Leap_year = sum(Length * LeapYear),
Height_Leap_year = sum(Height * LeapYear),
Height_Leap_year_Weekend = sum(Height * LeapYear * Weekend),
Purchases_Weekend = sum(Purchases * Weekend))
# A tibble: 3 x 10
Color Length_s Height_s Weight_s Cost_s Purchases_s Length_Leap_year Height_Leap_year Height_Leap_year_Weekend Purchases_Weeke~
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Blue 30 105 250 105 15 20 70 0 5
2 Red 30 78 225 90 15 15 50 50 5
3 Yellow 15 80 140 80 15 0 0 0 15
I have original temperature data in table1.txt with station number header which reads as
Date 101 102 103
1/1/2001 25 24 23
1/2/2001 23 20 15
1/3/2001 22 21 17
1/4/2001 21 27 18
1/5/2001 22 30 19
I have a lookup table file lookup.txt which reads as :
ID Station
1 101
2 103
3 102
4 101
5 102
Now, I want to create a new table (new.txt) with ID number header which should read as
Date 1 2 3 4 5
1/1/2001 25 23 24 25 24
1/2/2001 23 15 20 23 20
1/3/2001 22 17 21 22 21
1/4/2001 21 18 27 21 27
1/5/2001 22 19 30 22 30
Is there anyway I can do this in R or matlab??
I came up with a solution using tidyverse. It involves some wide to long transformation, matching the data frames on Station, and then spreading the variables.
#Recreating the data
library(tidyverse)
df1 <- read_table("text1.txt")
lookup <- read_table("lookup.txt")
#Create the output
k1 <- df1 %>%
gather(Station, value, -Date) %>%
mutate(Station = as.numeric(Station)) %>%
inner_join(lookup) %>% select(-Station) %>%
spread(ID, value)
k1
We can use base R to do this. Create a column index by matching the 'Station' column with the names of the first dataset, use that to duplicate the columns of 'df1' and then change the column names with the 'ID' column of second dataset
i1 <- with(df2, match(Station, names(df1)[-1]))
dfN <- df1[c(1, i1 + 1)]
names(dfN)[-1] <- df2$ID
dfN
# Date 1 2 3 4 5
#1 1/1/2001 25 23 24 25 24
#2 1/2/2001 23 15 20 23 20
#3 1/3/2001 22 17 21 22 21
#4 1/4/2001 21 18 27 21 27
#5 1/5/2001 22 19 30 22 30
data
df1 <- structure(list(Date = c("1/1/2001", "1/2/2001", "1/3/2001", "1/4/2001",
"1/5/2001"), `101` = c(25L, 23L, 22L, 21L, 22L), `102` = c(24L,
20L, 21L, 27L, 30L), `103` = c(23L, 15L, 17L, 18L, 19L)),
class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ID = 1:5, Station = c(101L, 103L, 102L, 101L,
102L)), class = "data.frame", row.names = c(NA, -5L))
Here is an option with MatLab:
T = readtable('table1.txt','FileType','text','ReadVariableNames',1);
L = readtable('lookup.txt','FileType','text','ReadVariableNames',1);
old_header = strcat('x',num2str(L.Station));
newT = array2table(zeros(height(T),height(L)+1),...
'VariableNames',[{'Date'} strcat('x',num2cell(num2str(L.ID)).')]);
newT.Date = T.Date;
for k = 1:size(old_header,1)
newT{:,k+1} = T.(old_header(k,:));
end
writetable(newT,'new.txt','Delimiter',' ')